417 lines
20 KiB
Markdown
417 lines
20 KiB
Markdown
# Research Log — DonkeyCar RL Autoresearch
|
||
|
||
> Chronological research findings, discoveries, bugs, and decisions.
|
||
> Every significant observation is recorded here for scientific reproducibility and future reference.
|
||
> Format: date, finding, evidence, action taken.
|
||
|
||
---
|
||
|
||
## 2026-04-12 — Project Kickoff and Initial Infrastructure
|
||
|
||
### Finding: Grid Sweep as Research Baseline
|
||
|
||
**Observation:** Before any autoresearch, we ran an 18-config grid sweep across:
|
||
- `n_steer`: [3, 5, 7]
|
||
- `n_throttle`: [2, 3]
|
||
- `learning_rate`: [0.001, 0.0005, 0.0001]
|
||
- 3 repeats each
|
||
|
||
**Important caveat discovered later:** This sweep used a **random action policy** (bug — model training code had been removed). The rewards reflect how well a random policy can stumble through different action discretizations.
|
||
|
||
**Valid insight from this data:** Action discretization matters even for random policy.
|
||
`n_steer=7, n_throttle=2` outperformed `n_steer=3, n_throttle=2` with random actions — more steering granularity helps even without learning.
|
||
|
||
**Data location:** `outerloop-results/clean_sweep_results.jsonl` (18 records)
|
||
|
||
---
|
||
|
||
## 2026-04-12 — Discovery: Random Policy Bug (Critical)
|
||
|
||
### Finding: Inner Loop Was Never Training
|
||
|
||
**Observation:** The `donkeycar_sb3_runner.py` was calling `env.action_space.sample()` instead of `model.learn()`. This was introduced when we removed the broken `model.save()` call that caused `NameError: name 'model' is not defined`.
|
||
|
||
**Root cause:** Legacy code path removal was too aggressive — removed training along with the broken save call.
|
||
|
||
**Impact:**
|
||
- All 300 autoresearch trials (two overnight runs) used random policy
|
||
- `learning_rate` parameter was passed but completely ignored
|
||
- `mean_reward` values reflect random-walk quality, not RL training quality
|
||
- The GP+UCB found the best *action space for random walking*, not the best *hyperparameters for learning*
|
||
|
||
**Valid salvage:** The `n_steer=8, n_throttle=5` finding is valid as a discretization insight.
|
||
**Invalid:** All learning_rate optimization in the 300-trial autoresearch runs.
|
||
|
||
**Fix:** Completely rebuilt runner with real `PPO.learn()` + `evaluate_policy()` + `model.save()`.
|
||
|
||
**Decision record:** ADR-005 — Never call model.save() before model is defined.
|
||
|
||
---
|
||
|
||
## 2026-04-12 — Autoresearch Infrastructure Proven
|
||
|
||
### Finding: GP+UCB Autoresearch Works Correctly
|
||
|
||
**Observation:** The GP+UCB meta-controller correctly:
|
||
- Loads prior results and fits a Gaussian Process
|
||
- Uses UCB acquisition to balance exploration/exploitation
|
||
- Proposes parameters outside the original grid (e.g., `n_steer=6` was never in grid)
|
||
- Converges toward higher-reward regions with each trial
|
||
|
||
**Evidence:** After 300 trials, the top-5 consistently clustered around `n_steer=7-9, n_throttle=4-5, lr≈0.002` — a coherent high-reward region.
|
||
|
||
**Conclusion:** The infrastructure is sound. The data was from wrong experiments, but the meta-loop works exactly as designed.
|
||
|
||
---
|
||
|
||
## 2026-04-13 — Phase 1 Launch: First Real Training Attempt
|
||
|
||
### Finding: Timeout — PPO+CNN is Too Slow on CPU for Large Timesteps
|
||
|
||
**Observation:** First Phase 1 run with real PPO training proposed 20k-30k timesteps.
|
||
At ~0.05-0.1 steps/sec (PPO+CNN on CPU), this requires 2000-6000 seconds per trial — far exceeding the 600-second timeout.
|
||
|
||
**Evidence:** Trials 1-6 all timed out at exactly 600 seconds.
|
||
|
||
**Fix:** Reduced timestep search space from [5000, 30000] to [1000, 5000].
|
||
At ~15-30 steps/sec (DonkeyCar sim speed), 5000 steps ≈ 170-330 seconds. Fits within 480s timeout.
|
||
|
||
**Lesson:** Always calibrate timeout to actual sim + training speed before launching sweeps.
|
||
|
||
---
|
||
|
||
## 2026-04-13 — Discovery: Car Not Moving (PPO Throttle Problem)
|
||
|
||
**Observation:** During early Phase 1 training, the car's steering values changed but the car did not move.
|
||
|
||
**Root cause:** PPO with continuous action space outputs actions in `[-1, 1]` for all dimensions.
|
||
DonkeyCar expects `throttle ∈ [0, 1]`. When PPO's random initial policy outputs throttle ≈ -0.5, it gets clipped to 0 — the car sits still.
|
||
|
||
**Fix:** Added `ThrottleClampWrapper` that ensures throttle ∈ [0.2, 1.0].
|
||
This guarantees the car always moves forward, even before any learning.
|
||
|
||
**Impact:** Without this fix, the car never moves and the health check detects it as a stuck sim, prematurely killing training.
|
||
|
||
---
|
||
|
||
## 2026-04-13 — Critical Discovery: Reward Hacking via SpeedRewardWrapper 🚨
|
||
|
||
### Finding: Model Learned to Exploit Speed Reward by Oscillating at Track Boundary
|
||
|
||
**Observation:** After fixing throttle and timestep issues, Phase 1 trials ran successfully.
|
||
Some trials produced suspiciously high rewards:
|
||
|
||
| Trial | mean_reward | n_throttle | lr | verdict |
|
||
|-------|-------------|------------|--------|---------|
|
||
| 8 | **1936.9** | 2 | 0.00145 | 🚨 HACKED |
|
||
| 13 | **1139.4** | 2 | 0.00058 | 🚨 HACKED |
|
||
| 11 | 439.9 | 3 | 0.00048 | ⚠️ Suspicious |
|
||
| 2 | 398.9 | 2 | 0.00236 | ⚠️ Suspicious |
|
||
|
||
**Root cause:** The `SpeedRewardWrapper` computed:
|
||
```
|
||
reward = speed × (1 - abs(cte) / max_cte)
|
||
```
|
||
|
||
The model discovered a policy that **maximizes this formula without genuine track driving**:
|
||
1. Drive fast toward the track boundary
|
||
2. Return to track center (momentarily low CTE = high reward)
|
||
3. Repeat — "oscillation farming"
|
||
|
||
The crash penalty (`-10`) was insufficient to deter this because thousands of oscillation steps accumulate far more positive reward.
|
||
|
||
**Physical impossibility check:** A car driving at max speed (≈5 m/s) perfectly centered for 3429 steps would accumulate ≈ `5.0 × 1.0 × 3429 = 17,145`. Observed max was 1937 — so technically possible but the high variance (`std_reward=34`) across only 3 eval episodes and the user's direct observation confirm hacking.
|
||
|
||
**User observation (direct visual confirmation):** "The model found a way to rig the reward by just going left — it was off the track and then back on the track."
|
||
|
||
**Impact:** The entire Phase 1 dataset with `reward_shaping=True` is corrupted.
|
||
The GP fitted on these rewards was optimizing for hacking parameters, not driving parameters.
|
||
|
||
**Action taken:**
|
||
- Archived all Phase 1 results: `autoresearch_results_phase1_CORRUPTED_reward_hacking.jsonl`
|
||
- Archived hacked models: `models/ARCHIVED_reward_hacking/`
|
||
- Redesigned reward function entirely
|
||
|
||
---
|
||
|
||
## 2026-04-13 — Fix: Hack-Proof Reward Shaping Design
|
||
|
||
### Finding: Multiplicative Speed Bonus Prevents Reward Hacking
|
||
|
||
**Problem with additive formula:** `reward = speed × f(cte)` can be maximized by maximizing speed independently of f(cte).
|
||
|
||
**Solution — multiplicative on-track bonus:**
|
||
```python
|
||
if original_reward > 0:
|
||
shaped = original_reward × (1 + speed_scale × speed)
|
||
else:
|
||
shaped = original_reward # No speed bonus when off track
|
||
```
|
||
|
||
**Why this is hack-proof:**
|
||
- `original_reward > 0` is ONLY true when the car is on track AND centered (DonkeyCar's own CTE signal)
|
||
- When off track, `original_reward ≤ 0` — no speed reward possible
|
||
- The model cannot increase reward by going fast off-track
|
||
- The formula is bounded: `shaped ≤ original_reward × (1 + speed_scale × max_speed)`
|
||
|
||
**Author's insight:** "Speed should only be rewarded if you are progressing down the track."
|
||
|
||
**Implementation:** `agent/reward_wrapper.py` — `SpeedRewardWrapper` v2.
|
||
|
||
---
|
||
|
||
## 2026-04-13 — Lesson: Reward Function Design Principles
|
||
|
||
From this experience, we derived the following principles for DonkeyCar RL reward shaping:
|
||
|
||
1. **Never reward speed unconditionally.** Speed reward must be gated on track presence.
|
||
2. **The original DonkeyCar reward is the ground truth.** Any shaping must respect it, not replace it.
|
||
3. **Multiplicative bonuses are safer than additive.** They can't be maximized independently.
|
||
4. **High variance in eval reward is a red flag.** `std_reward=34` on 3 episodes suggests instability.
|
||
5. **Physically impossible reward values signal hacking.** Establish theoretical reward bounds before training.
|
||
6. **Low `n_throttle` (=2) may enable hacking.** With only 2 throttle values, the model may discover degenerate oscillation policies more easily. Investigate.
|
||
|
||
---
|
||
|
||
## Next Research Questions
|
||
|
||
1. **Does `n_throttle=2` uniquely enable hacking?** The hacked models all had `n_throttle=2`. With only 2 throttle states (stop/full-throttle), oscillation may be easier to exploit.
|
||
2. **What is the minimum timestep for genuine learning?** The low-reward trials (5-22) may not have trained long enough. Is 3000 steps sufficient for any real driving behavior?
|
||
3. **Does the multiplicative reward fix change the optimal hyperparameter region?** Re-run autoresearch with fixed reward and compare top configurations.
|
||
4. **Can we detect reward hacking automatically?** A reward-per-step threshold (e.g., flag if mean > 2.0 per step) could auto-detect hacking during training.
|
||
5. **What does a genuinely good reward look like?** After completing Phase 1 cleanly, characterize the reward distribution of a car that drives one full lap.
|
||
|
||
---
|
||
|
||
## 2026-04-13 — Critical Discovery: Circular Driving Exploit (v2 Reward Still Hackable)
|
||
|
||
### Finding: Car Learns to Circle at Starting Line
|
||
|
||
**User observation (direct visual):** "The model found a way to rig the reward by going left in circles — it was off the track and then back on track, but detected as failure. Model uses this as best way to maximize reward."
|
||
|
||
**Data confirmation:**
|
||
|
||
| Trial | mean_reward | std_reward | cv% | r/step | verdict |
|
||
|-------|-------------|------------|-------|--------|---------|
|
||
| 1 | 270.56 | 0.143 | 0.1% | 0.086 | ⚠️ CIRCULAR (suspiciously low std) |
|
||
| 5 | **4582.80** | **0.485** | **0.0%** | **0.957** | 🚨 CIRCULAR (confirmed) |
|
||
| 10 | 682.74 | 420.91 | 61.7% | 0.153 | ⚠️ UNSTABLE (sometimes circles, sometimes crashes) |
|
||
|
||
**Statistical signature of circular motion:**
|
||
- cv (coefficient of variation = std/mean) < 1% with high reward → very consistent behavior
|
||
- Circular driving IS very consistent: every circle is the same
|
||
- Legitimate driving is stochastic: different obstacles, curves, luck
|
||
- Trial 5: cv=0.0% over 3 eval episodes → textbook circling
|
||
|
||
**Why v2 reward still allowed this:**
|
||
- v2 fix: `reward = original × (1 + speed_scale × speed)` ONLY when on track
|
||
- Car circling at the starting line HAS: low CTE (on track centerline) + positive speed
|
||
- Result: full speed bonus for circling → 4582 reward over 4787 steps
|
||
- CTE and raw speed cannot distinguish forward from circular motion
|
||
|
||
### Root Cause: Missing Dimension — Track Progress
|
||
|
||
The fundamental issue: **neither CTE nor speed captures PROGRESS along the track.**
|
||
- CTE measures: am I near the centerline? (yes for circles)
|
||
- Speed measures: am I moving? (yes for circles)
|
||
- Progress measures: am I getting anywhere new? (NO for circles)
|
||
|
||
### Fix: Path Efficiency Reward (v3)
|
||
|
||
**Formula:**
|
||
```
|
||
efficiency = net_displacement / total_path_length (over sliding window of 30 steps)
|
||
shaped_reward = original_reward × (1 + speed_scale × speed × efficiency)
|
||
```
|
||
|
||
**Why this works:**
|
||
- Forward driving: `efficiency ≈ 1.0` (all movement is productive)
|
||
- Circular driving: `efficiency ≈ 0.0` (lots of steps, car returns to start position)
|
||
- The speed bonus disappears when circling → car incentivized to go FORWARD
|
||
|
||
**Proof (tests):**
|
||
- `test_efficiency_near_zero_for_circular_driving`: efficiency < 0.2 after full circle
|
||
- `test_efficiency_near_one_for_straight_driving`: efficiency > 0.90 for straight line
|
||
- `test_straight_driving_gets_higher_reward_than_circular`: key guarantee test
|
||
|
||
**Data archived:**
|
||
- `autoresearch_results_phase1_CORRUPTED_circular_driving.jsonl` (12 records, circular)
|
||
- `models/ARCHIVED_circular_driving/` (trial-0001 through trial-0013)
|
||
|
||
### Lesson: cv% is a Reward Hacking Indicator
|
||
|
||
| cv% | Interpretation |
|
||
|------|----------------|
|
||
| < 1% + high reward | Likely reward hacking (very consistent exploit) |
|
||
| 1-10% | Normal RL variance |
|
||
| > 50% | Unstable policy, inconsistent behavior |
|
||
|
||
This metric will be added to the autoresearch result logging and summary.
|
||
|
||
---
|
||
|
||
## 2026-04-13 — 🏆 PHASE 1 MILESTONE: Genuine Track Driving Confirmed!
|
||
|
||
### Finding: Champion Model Drives the Track — Real RL Behaviour Proven
|
||
|
||
**This is the first confirmed genuine driving result from the autoresearch pipeline.**
|
||
|
||
**Visual confirmation (user):** "It is definitely driving! The donkeycar is driving along the track!"
|
||
|
||
**Evaluation data — 3 episodes, 1500 max steps:**
|
||
|
||
| Episode | Steps | Total Reward | Std | Efficiency |
|
||
|---------|-------|-------------|-------|------------|
|
||
| 1 | 599 | 1022.73 | — | 96-100% |
|
||
| 2 | 598 | 1023.35 | — | 96-100% |
|
||
| 3 | 599 | 1022.25 | — | 96-100% |
|
||
| **Mean** | **599** | **1022.78** | **0.45** | **~99%** |
|
||
|
||
**Champion Model Parameters:**
|
||
- agent: PPO, n_steer=7, n_throttle=3, lr=0.000680, timesteps=4787
|
||
- Path: `agent/models/champion/model.zip`
|
||
|
||
### Track Trajectory Analysis
|
||
|
||
```
|
||
Start: Pos(6.25, 6.30) → Starting line
|
||
Step 300: Pos(22.80, 2.09) → Long straight, approaching first corner
|
||
Step 400: Pos(18.80, -6.96) → Negotiating first right-hand curve ✅
|
||
Step 500: Pos(28.12, -5.61) → Continuing along second straight
|
||
Step 560: Pos(33.12, -6.55) → Approaching second corner
|
||
Step 599: CRASH CTE=8.26 → Off track at second corner ❌
|
||
```
|
||
|
||
The car successfully:
|
||
- Accelerates from 0 → 2.3 m/s along the straight
|
||
- Navigates the first right-hand curve
|
||
- Follows the track for ~600 steps covering ~30+ position units
|
||
|
||
### Failure Analysis: The S-Curve Crash
|
||
|
||
**User observation:** "The spot where the donkeycar goes off the track is during a right hand curve which quickly turns into a left hand curve. It doesn't even look like it sees the left hand curve."
|
||
|
||
**What the data shows:**
|
||
- Steps 540-560: CTE briefly near zero (0.24) — car approaches corner well
|
||
- Steps 570+: CTE explodes 1.4 → 3.8 → 5.9 → 8.3 — car overshoots
|
||
- Speed at crash: 2.23-2.30 m/s — too fast for the S-curve
|
||
|
||
**Root cause:** Only 4787 training timesteps — insufficient to learn:
|
||
1. Speed reduction approaching corners
|
||
2. Left-turn recovery after right-hand overshoot
|
||
3. S-curve geometry (right → quick left transition)
|
||
|
||
**Key insight: The model never sees the left-hand curve** because it has always crashed at the right-hand part first during training. This is an exploration problem — the car needs more timesteps to get past this point and discover what's beyond.
|
||
|
||
### Reward Shaping Victory
|
||
|
||
All 3 reward hacking fixes proved necessary and correct:
|
||
- v1 additive → boundary oscillation exploit
|
||
- v2 multiplicative → circular driving exploit
|
||
- v3 path efficiency → genuine forward driving ✅
|
||
|
||
The path efficiency metric (96-100% throughout entire run) confirms the car is making continuous forward progress — not circling, not oscillating.
|
||
|
||
### Phase 1 → Phase 2 Transition
|
||
|
||
**Phase 1 objective achieved:** A real PPO model drives the DonkeyCar track with genuine forward motion, consistent behaviour (std=0.45), and correct trajectory.
|
||
|
||
**Next objective (targeted autoresearch):** Learn corner handling and speed modulation.
|
||
- Increase timesteps to 10,000-50,000 per trial
|
||
- The model needs to see the S-curve many times to learn the transition
|
||
- Consider adding a CTE-rate-of-change penalty to discourage high speed at high CTE
|
||
|
||
### This is Research!
|
||
|
||
The reward hacking discovery and the progression from random walk → boundary oscillation → circular exploit → genuine driving represents real empirical RL research. Each failure mode revealed a fundamental property of reward design. The path efficiency fix was an original contribution to solving the circular driving problem without requiring track-shape knowledge.
|
||
|
||
---
|
||
|
||
## 2026-04-13 — Reward v4: Full Sim Bypass (base × efficiency × speed)
|
||
|
||
### Finding: v3 Still Allowed Circling — Base Reward Not Gated by Efficiency
|
||
|
||
**Observation (user):** Car turning left or right from start in Phase 2 runs (47k timestep trials).
|
||
|
||
**Root cause discovered in `donkey_sim.py`:**
|
||
```python
|
||
# sim's own reward (lines 478-498):
|
||
if self.forward_vel > 0.0:
|
||
return (1.0 - abs(cte)/max_cte) * self.forward_vel
|
||
```
|
||
`forward_vel` = dot(car_heading, velocity). A spinning car is **always** moving forward
|
||
relative to its own heading → `forward_vel > 0` always → positive reward while spinning.
|
||
|
||
**Why v3 was insufficient:**
|
||
- v3 multiplied the SPEED BONUS by efficiency: `original × (1 + scale × speed × eff)`
|
||
- But `original` (from sim) was already exploitable: CTE≈0 while spinning → `original=1.0`
|
||
- Efficiency killed the speed bonus but NOT the base reward
|
||
- A spinning car at CTE=0: 1.0/step × 47k steps = 47k total reward (never crashes in circle!)
|
||
|
||
**Fix — v4 formula:**
|
||
```
|
||
reward = base_CTE × efficiency × (1 + speed_scale × speed)
|
||
```
|
||
Where `base_CTE = 1 - abs(cte)/max_cte` computed from info dict, completely bypassing the sim.
|
||
|
||
- Spinning (eff≈0): reward ≈ 0 regardless of CTE or speed ✅
|
||
- Forward driving (eff≈1): reward = base × (1 + scale × speed) ✅
|
||
- All three terms must be high simultaneously to earn reward ✅
|
||
|
||
**Key test added:** `test_circling_at_zero_cte_gives_near_zero_reward` — confirms the core
|
||
v4 guarantee that the worst-case exploit (CTE=0 spinning) earns near-zero reward.
|
||
|
||
**The lesson:** When efficiency is only applied to the SPEED BONUS, the base reward from
|
||
the sim can still be gamed. The efficiency multiplier must apply to the ENTIRE reward.
|
||
|
||
---
|
||
|
||
## 2026-04-14 — 🏆 PHASE 2 MILESTONE: All Top Models Complete the Track!
|
||
|
||
### Finding: Track Completion Achieved — Multiple Distinct Driving Styles
|
||
|
||
**User visual confirmation:** All 3 top Phase 2 models successfully complete the entire track!
|
||
|
||
**Model comparison at 3000 steps:**
|
||
|
||
| Model | Steps | Reward | Std | Driving Style |
|
||
|-------|-------|--------|-----|---------------|
|
||
| Trial 20 (n_steer=3, n_throttle=5, lr=0.000225, 13k steps) | **2874** | 2297 | 5.7 | Right lane, very stable ⭐ |
|
||
| Trial 8 (n_steer=4, n_throttle=3, lr=0.00117, 34k steps) | 2258 | 2072 | 0.4 | Left/center, oscillating |
|
||
| Trial 18 (n_steer=3, n_throttle=5, lr=0.000288, 16k steps) | 2256 | 2072 | 0.4 | Right shoulder, very accurate |
|
||
|
||
**Key insight — the track ENDS!** The runs don't time out — the car genuinely completes the full track. The CTE spike at the end is the car reaching the track boundary/finish.
|
||
|
||
### Why Different Driving Styles Emerged
|
||
|
||
**Action space discretization is the dominant factor:**
|
||
- `n_steer=3`: Only LEFT/STRAIGHT/RIGHT → decisive, committed steering → clean lane following
|
||
- `n_steer=4`: 4 steer positions → oscillating correction policy (still completes track)
|
||
- `n_throttle=5`: More speed granularity → smoother corner negotiation
|
||
|
||
**CTE reward symmetry creates multiple valid solutions:**
|
||
The reward `base_CTE × efficiency × speed` is symmetric — driving 0.5m left of center = driving 0.5m right of center (same |CTE|). PPO random initialization determines which symmetric solution the model converges to. This is why Trials 20 and 18 drive on opposite sides of the road despite similar hyperparameters.
|
||
|
||
**Emergent counterintuitive finding: FEWER steering bins → BETTER driving**
|
||
Trial 20 (n_steer=3) outperforms Trial 8 (n_steer=4) both in distance and smoothness. With only 3 steering bins, the model is forced to commit to decisive actions, developing a cleaner driving policy. More action granularity introduced oscillation without improving performance.
|
||
|
||
### Can We Control Driving Behaviour?
|
||
|
||
Yes! Through targeted reward shaping:
|
||
1. **Lane position targeting**: `reward = 1 - abs(cte - target_offset)/max_cte` → bias to specific lane position
|
||
2. **Anti-oscillation penalty**: Penalize rapid steering changes → eliminates Model 2 oscillation
|
||
3. **Asymmetric CTE**: Penalize left-of-center more → enforces right-lane driving rule
|
||
4. **Speed zones**: Reward deceleration before corners (future work)
|
||
|
||
### Phase 2 → Phase 3 Transition
|
||
|
||
**Phase 2 objective ACHIEVED:** Models complete the full track with genuine learned driving behaviour.
|
||
|
||
**Phase 3 objectives:**
|
||
- Behavioral control (lane position, oscillation suppression)
|
||
- Speed optimization (fastest lap time)
|
||
- Multi-track generalization
|
||
- Fine-tuning from Phase 2 champion
|
||
|
||
**Phase 2 Champion:** Trial 20 — n_steer=3, n_throttle=5, lr=0.000225, 13k steps
|