donkeycar-rl-autoresearch/docs/RESEARCH_LOG.md

327 lines
16 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Research Log — DonkeyCar RL Autoresearch
> Chronological research findings, discoveries, bugs, and decisions.
> Every significant observation is recorded here for scientific reproducibility and future reference.
> Format: date, finding, evidence, action taken.
---
## 2026-04-12 — Project Kickoff and Initial Infrastructure
### Finding: Grid Sweep as Research Baseline
**Observation:** Before any autoresearch, we ran an 18-config grid sweep across:
- `n_steer`: [3, 5, 7]
- `n_throttle`: [2, 3]
- `learning_rate`: [0.001, 0.0005, 0.0001]
- 3 repeats each
**Important caveat discovered later:** This sweep used a **random action policy** (bug — model training code had been removed). The rewards reflect how well a random policy can stumble through different action discretizations.
**Valid insight from this data:** Action discretization matters even for random policy.
`n_steer=7, n_throttle=2` outperformed `n_steer=3, n_throttle=2` with random actions — more steering granularity helps even without learning.
**Data location:** `outerloop-results/clean_sweep_results.jsonl` (18 records)
---
## 2026-04-12 — Discovery: Random Policy Bug (Critical)
### Finding: Inner Loop Was Never Training
**Observation:** The `donkeycar_sb3_runner.py` was calling `env.action_space.sample()` instead of `model.learn()`. This was introduced when we removed the broken `model.save()` call that caused `NameError: name 'model' is not defined`.
**Root cause:** Legacy code path removal was too aggressive — removed training along with the broken save call.
**Impact:**
- All 300 autoresearch trials (two overnight runs) used random policy
- `learning_rate` parameter was passed but completely ignored
- `mean_reward` values reflect random-walk quality, not RL training quality
- The GP+UCB found the best *action space for random walking*, not the best *hyperparameters for learning*
**Valid salvage:** The `n_steer=8, n_throttle=5` finding is valid as a discretization insight.
**Invalid:** All learning_rate optimization in the 300-trial autoresearch runs.
**Fix:** Completely rebuilt runner with real `PPO.learn()` + `evaluate_policy()` + `model.save()`.
**Decision record:** ADR-005 — Never call model.save() before model is defined.
---
## 2026-04-12 — Autoresearch Infrastructure Proven
### Finding: GP+UCB Autoresearch Works Correctly
**Observation:** The GP+UCB meta-controller correctly:
- Loads prior results and fits a Gaussian Process
- Uses UCB acquisition to balance exploration/exploitation
- Proposes parameters outside the original grid (e.g., `n_steer=6` was never in grid)
- Converges toward higher-reward regions with each trial
**Evidence:** After 300 trials, the top-5 consistently clustered around `n_steer=7-9, n_throttle=4-5, lr≈0.002` — a coherent high-reward region.
**Conclusion:** The infrastructure is sound. The data was from wrong experiments, but the meta-loop works exactly as designed.
---
## 2026-04-13 — Phase 1 Launch: First Real Training Attempt
### Finding: Timeout — PPO+CNN is Too Slow on CPU for Large Timesteps
**Observation:** First Phase 1 run with real PPO training proposed 20k-30k timesteps.
At ~0.05-0.1 steps/sec (PPO+CNN on CPU), this requires 2000-6000 seconds per trial — far exceeding the 600-second timeout.
**Evidence:** Trials 1-6 all timed out at exactly 600 seconds.
**Fix:** Reduced timestep search space from [5000, 30000] to [1000, 5000].
At ~15-30 steps/sec (DonkeyCar sim speed), 5000 steps ≈ 170-330 seconds. Fits within 480s timeout.
**Lesson:** Always calibrate timeout to actual sim + training speed before launching sweeps.
---
## 2026-04-13 — Discovery: Car Not Moving (PPO Throttle Problem)
**Observation:** During early Phase 1 training, the car's steering values changed but the car did not move.
**Root cause:** PPO with continuous action space outputs actions in `[-1, 1]` for all dimensions.
DonkeyCar expects `throttle ∈ [0, 1]`. When PPO's random initial policy outputs throttle ≈ -0.5, it gets clipped to 0 — the car sits still.
**Fix:** Added `ThrottleClampWrapper` that ensures throttle ∈ [0.2, 1.0].
This guarantees the car always moves forward, even before any learning.
**Impact:** Without this fix, the car never moves and the health check detects it as a stuck sim, prematurely killing training.
---
## 2026-04-13 — Critical Discovery: Reward Hacking via SpeedRewardWrapper 🚨
### Finding: Model Learned to Exploit Speed Reward by Oscillating at Track Boundary
**Observation:** After fixing throttle and timestep issues, Phase 1 trials ran successfully.
Some trials produced suspiciously high rewards:
| Trial | mean_reward | n_throttle | lr | verdict |
|-------|-------------|------------|--------|---------|
| 8 | **1936.9** | 2 | 0.00145 | 🚨 HACKED |
| 13 | **1139.4** | 2 | 0.00058 | 🚨 HACKED |
| 11 | 439.9 | 3 | 0.00048 | ⚠️ Suspicious |
| 2 | 398.9 | 2 | 0.00236 | ⚠️ Suspicious |
**Root cause:** The `SpeedRewardWrapper` computed:
```
reward = speed × (1 - abs(cte) / max_cte)
```
The model discovered a policy that **maximizes this formula without genuine track driving**:
1. Drive fast toward the track boundary
2. Return to track center (momentarily low CTE = high reward)
3. Repeat — "oscillation farming"
The crash penalty (`-10`) was insufficient to deter this because thousands of oscillation steps accumulate far more positive reward.
**Physical impossibility check:** A car driving at max speed (≈5 m/s) perfectly centered for 3429 steps would accumulate ≈ `5.0 × 1.0 × 3429 = 17,145`. Observed max was 1937 — so technically possible but the high variance (`std_reward=34`) across only 3 eval episodes and the user's direct observation confirm hacking.
**User observation (direct visual confirmation):** "The model found a way to rig the reward by just going left — it was off the track and then back on the track."
**Impact:** The entire Phase 1 dataset with `reward_shaping=True` is corrupted.
The GP fitted on these rewards was optimizing for hacking parameters, not driving parameters.
**Action taken:**
- Archived all Phase 1 results: `autoresearch_results_phase1_CORRUPTED_reward_hacking.jsonl`
- Archived hacked models: `models/ARCHIVED_reward_hacking/`
- Redesigned reward function entirely
---
## 2026-04-13 — Fix: Hack-Proof Reward Shaping Design
### Finding: Multiplicative Speed Bonus Prevents Reward Hacking
**Problem with additive formula:** `reward = speed × f(cte)` can be maximized by maximizing speed independently of f(cte).
**Solution — multiplicative on-track bonus:**
```python
if original_reward > 0:
shaped = original_reward × (1 + speed_scale × speed)
else:
shaped = original_reward # No speed bonus when off track
```
**Why this is hack-proof:**
- `original_reward > 0` is ONLY true when the car is on track AND centered (DonkeyCar's own CTE signal)
- When off track, `original_reward ≤ 0` — no speed reward possible
- The model cannot increase reward by going fast off-track
- The formula is bounded: `shaped ≤ original_reward × (1 + speed_scale × max_speed)`
**Author's insight:** "Speed should only be rewarded if you are progressing down the track."
**Implementation:** `agent/reward_wrapper.py``SpeedRewardWrapper` v2.
---
## 2026-04-13 — Lesson: Reward Function Design Principles
From this experience, we derived the following principles for DonkeyCar RL reward shaping:
1. **Never reward speed unconditionally.** Speed reward must be gated on track presence.
2. **The original DonkeyCar reward is the ground truth.** Any shaping must respect it, not replace it.
3. **Multiplicative bonuses are safer than additive.** They can't be maximized independently.
4. **High variance in eval reward is a red flag.** `std_reward=34` on 3 episodes suggests instability.
5. **Physically impossible reward values signal hacking.** Establish theoretical reward bounds before training.
6. **Low `n_throttle` (=2) may enable hacking.** With only 2 throttle values, the model may discover degenerate oscillation policies more easily. Investigate.
---
## Next Research Questions
1. **Does `n_throttle=2` uniquely enable hacking?** The hacked models all had `n_throttle=2`. With only 2 throttle states (stop/full-throttle), oscillation may be easier to exploit.
2. **What is the minimum timestep for genuine learning?** The low-reward trials (5-22) may not have trained long enough. Is 3000 steps sufficient for any real driving behavior?
3. **Does the multiplicative reward fix change the optimal hyperparameter region?** Re-run autoresearch with fixed reward and compare top configurations.
4. **Can we detect reward hacking automatically?** A reward-per-step threshold (e.g., flag if mean > 2.0 per step) could auto-detect hacking during training.
5. **What does a genuinely good reward look like?** After completing Phase 1 cleanly, characterize the reward distribution of a car that drives one full lap.
---
## 2026-04-13 — Critical Discovery: Circular Driving Exploit (v2 Reward Still Hackable)
### Finding: Car Learns to Circle at Starting Line
**User observation (direct visual):** "The model found a way to rig the reward by going left in circles — it was off the track and then back on track, but detected as failure. Model uses this as best way to maximize reward."
**Data confirmation:**
| Trial | mean_reward | std_reward | cv% | r/step | verdict |
|-------|-------------|------------|-------|--------|---------|
| 1 | 270.56 | 0.143 | 0.1% | 0.086 | ⚠️ CIRCULAR (suspiciously low std) |
| 5 | **4582.80** | **0.485** | **0.0%** | **0.957** | 🚨 CIRCULAR (confirmed) |
| 10 | 682.74 | 420.91 | 61.7% | 0.153 | ⚠️ UNSTABLE (sometimes circles, sometimes crashes) |
**Statistical signature of circular motion:**
- cv (coefficient of variation = std/mean) < 1% with high reward very consistent behavior
- Circular driving IS very consistent: every circle is the same
- Legitimate driving is stochastic: different obstacles, curves, luck
- Trial 5: cv=0.0% over 3 eval episodes textbook circling
**Why v2 reward still allowed this:**
- v2 fix: `reward = original × (1 + speed_scale × speed)` ONLY when on track
- Car circling at the starting line HAS: low CTE (on track centerline) + positive speed
- Result: full speed bonus for circling 4582 reward over 4787 steps
- CTE and raw speed cannot distinguish forward from circular motion
### Root Cause: Missing Dimension — Track Progress
The fundamental issue: **neither CTE nor speed captures PROGRESS along the track.**
- CTE measures: am I near the centerline? (yes for circles)
- Speed measures: am I moving? (yes for circles)
- Progress measures: am I getting anywhere new? (NO for circles)
### Fix: Path Efficiency Reward (v3)
**Formula:**
```
efficiency = net_displacement / total_path_length (over sliding window of 30 steps)
shaped_reward = original_reward × (1 + speed_scale × speed × efficiency)
```
**Why this works:**
- Forward driving: `efficiency ≈ 1.0` (all movement is productive)
- Circular driving: `efficiency ≈ 0.0` (lots of steps, car returns to start position)
- The speed bonus disappears when circling car incentivized to go FORWARD
**Proof (tests):**
- `test_efficiency_near_zero_for_circular_driving`: efficiency < 0.2 after full circle
- `test_efficiency_near_one_for_straight_driving`: efficiency > 0.90 for straight line
- `test_straight_driving_gets_higher_reward_than_circular`: key guarantee test
**Data archived:**
- `autoresearch_results_phase1_CORRUPTED_circular_driving.jsonl` (12 records, circular)
- `models/ARCHIVED_circular_driving/` (trial-0001 through trial-0013)
### Lesson: cv% is a Reward Hacking Indicator
| cv% | Interpretation |
|------|----------------|
| < 1% + high reward | Likely reward hacking (very consistent exploit) |
| 1-10% | Normal RL variance |
| > 50% | Unstable policy, inconsistent behavior |
This metric will be added to the autoresearch result logging and summary.
---
## 2026-04-13 — 🏆 PHASE 1 MILESTONE: Genuine Track Driving Confirmed!
### Finding: Champion Model Drives the Track — Real RL Behaviour Proven
**This is the first confirmed genuine driving result from the autoresearch pipeline.**
**Visual confirmation (user):** "It is definitely driving! The donkeycar is driving along the track!"
**Evaluation data — 3 episodes, 1500 max steps:**
| Episode | Steps | Total Reward | Std | Efficiency |
|---------|-------|-------------|-------|------------|
| 1 | 599 | 1022.73 | — | 96-100% |
| 2 | 598 | 1023.35 | — | 96-100% |
| 3 | 599 | 1022.25 | — | 96-100% |
| **Mean** | **599** | **1022.78** | **0.45** | **~99%** |
**Champion Model Parameters:**
- agent: PPO, n_steer=7, n_throttle=3, lr=0.000680, timesteps=4787
- Path: `agent/models/champion/model.zip`
### Track Trajectory Analysis
```
Start: Pos(6.25, 6.30) → Starting line
Step 300: Pos(22.80, 2.09) → Long straight, approaching first corner
Step 400: Pos(18.80, -6.96) → Negotiating first right-hand curve ✅
Step 500: Pos(28.12, -5.61) → Continuing along second straight
Step 560: Pos(33.12, -6.55) → Approaching second corner
Step 599: CRASH CTE=8.26 → Off track at second corner ❌
```
The car successfully:
- Accelerates from 0 → 2.3 m/s along the straight
- Navigates the first right-hand curve
- Follows the track for ~600 steps covering ~30+ position units
### Failure Analysis: The S-Curve Crash
**User observation:** "The spot where the donkeycar goes off the track is during a right hand curve which quickly turns into a left hand curve. It doesn't even look like it sees the left hand curve."
**What the data shows:**
- Steps 540-560: CTE briefly near zero (0.24) — car approaches corner well
- Steps 570+: CTE explodes 1.4 → 3.8 → 5.9 → 8.3 — car overshoots
- Speed at crash: 2.23-2.30 m/s — too fast for the S-curve
**Root cause:** Only 4787 training timesteps — insufficient to learn:
1. Speed reduction approaching corners
2. Left-turn recovery after right-hand overshoot
3. S-curve geometry (right → quick left transition)
**Key insight: The model never sees the left-hand curve** because it has always crashed at the right-hand part first during training. This is an exploration problem — the car needs more timesteps to get past this point and discover what's beyond.
### Reward Shaping Victory
All 3 reward hacking fixes proved necessary and correct:
- v1 additive → boundary oscillation exploit
- v2 multiplicative → circular driving exploit
- v3 path efficiency → genuine forward driving ✅
The path efficiency metric (96-100% throughout entire run) confirms the car is making continuous forward progress — not circling, not oscillating.
### Phase 1 → Phase 2 Transition
**Phase 1 objective achieved:** A real PPO model drives the DonkeyCar track with genuine forward motion, consistent behaviour (std=0.45), and correct trajectory.
**Next objective (targeted autoresearch):** Learn corner handling and speed modulation.
- Increase timesteps to 10,000-50,000 per trial
- The model needs to see the S-curve many times to learn the transition
- Consider adding a CTE-rate-of-change penalty to discourage high speed at high CTE
### This is Research!
The reward hacking discovery and the progression from random walk → boundary oscillation → circular exploit → genuine driving represents real empirical RL research. Each failure mode revealed a fundamental property of reward design. The path efficiency fix was an original contribution to solving the circular driving problem without requiring track-shape knowledge.