donkeycar-rl-autoresearch/docs/TEST_HISTORY.md

243 lines
14 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Test History — DonkeyCar RL Autoresearch
Last updated: 2026-04-18
This document records every significant training experiment, what was
changed, what was observed, and what was learned. Use this to make
methodical decisions rather than random changes.
---
## Baseline Models (Phase 1 & 2)
### Phase 2 Champion
- **Model:** `models/champion/model.zip`
- **Track trained on:** generated_road only
- **Steps:** 13,328
- **Hyperparams:** lr=0.000225, PPO continuous actions, ThrottleClamp(0.2), v4 reward
- **Result:** ✅ Drives generated_road perfectly, stays in right lane
- **Zero-shot:** ❌ Fails on generated_track (confirmed), ❌ Fails on mini_monaco
- **Notes:** Single track, simple road, model converged cleanly. Final model = best model (no divergence in 13k steps)
---
## Mountain Track Experiments
All experiments: mountain_track only, lr=0.000725, throttle_min varies, 90k steps
### Exp 1 — Mountain track, old v4 reward, throttle_min=0.2
- **Reward:** v4 (CTE × efficiency × speed)
- **throttle_min:** 0.2
- **Key observation:** Car gets partway up hill, slows, stops, rolls back. Always crashes at same step (~153-166). Steps logged: 0.200 throttle at hill = not enough power
- **Root cause:** v4 reward gives zero gradient signal on hill (efficiency→0, speed→0, reward→0 simultaneously, no direction for "apply more throttle")
- **Learned:** v4 reward is broken for inclined terrain
### Exp 2 — Mountain track, old v4 reward, throttle_min=0.2, continued to 200k
- **Reward:** v4
- **throttle_min:** 0.2
- **Key observation:** Only 2 behaviors: turn left and hit barrier, or go straight and hit barrier at turn
- **Result:** ❌ Killed early — no improvement
- **Learned:** More steps alone cannot fix a broken reward signal
### Exp 3 — Mountain track, old v4 reward, throttle_min=0.5
- **Reward:** v4
- **throttle_min:** 0.5 (increased to overcome hill)
- **Key observation:** Circle exploit dominated entire run — 0.5-1.75 second laps throughout
- **Lap times logged:** All short (exploit)
- **Result:** ❌ Model useless (reward=4.99 after 90k steps)
- **Learned:** Higher throttle got car over hill but circle exploit took over because v4 has no efficiency penalty when throttle is high
### Exp 4 — Continued from Exp 3 (200k total), old v4 reward, throttle_min=0.5
- **Reward:** v4
- **throttle_min:** 0.5
- **Key observation:** Killed early — same 2 behaviors (left into barrier, straight into barrier)
- **Result:** ❌ Killed
- **Learned:** Continuing bad training does not help
### Exp 5 — Mountain track, v5 reward, throttle_min=0.5 ⭐ KEY EXPERIMENT
- **Reward:** v5 (speed × CTE-quality) — NEW reward that directly incentivises throttle on hills
- **throttle_min:** 0.5
- **Method:** Direct model.learn() — NO train_multitrack(), ONE connection throughout
- **Key observation:** Genuine 20-22 second laps appearing from step ~30,000 onward
- **Lap times:** 19-22 seconds (genuine), consistently for 60k steps
- **Result:** ❌ Final model poor — best model was at step ~30k but we only saved final (step 90k) model
- **Root cause of failure:** No best-model saving. Policy peaked at 30k, diverged by 90k
- **Learned:**
1. v5 reward WORKS for mountain track
2. throttle_min=0.5 WORKS for hill
3. Direct model.learn() (no track switching) avoids phantom car issues
4. MUST save best model during training, not just final
### Exp 6 — Mountain track, v5 reward, throttle_min=0.5, train_multitrack (1 segment)
- **Reward:** v5
- **throttle_min:** 0.5 (first segment only — close_and_switch used 0.2 for subsequent segments)
- **Method:** train_multitrack() with steps_per_switch=90000 (one giant segment = one checkpoint)
- **Key observation:** Circle exploit dominated — only 0.5-1.75 second laps throughout
- **Result:** ❌ Only 1 checkpoint saved (at step 90k). Best reward=4.99
- **Root cause:** Using steps_per_switch=TOTAL_STEPS defeated checkpointing (one segment = one save). Circle exploit reappeared (different from Exp5 — random seed variation)
- **Learned:** steps_per_switch=TOTAL_STEPS is WRONG for single-track training with checkpointing
### Exp 7 — Mountain track, v5 reward + episode termination on short lap, throttle_min mixed
- **Reward:** v5 + short-lap now TERMINATES episode (not just penalty)
- **throttle_min:** 0.5 initial, 0.2 after segment 1 (bug: close_and_switch used module default)
- **Method:** train_multitrack() with steps_per_switch=6000 (15 segments)
- **Key observation:** Car in LEFT lane, sitting doing nothing. Not normal spawn position.
- **Hypothesis:** Phantom car from Exp6's ghost car still in sim. Two TCP connections spawned two cars. User watched phantom (left lane, no commands). Training went to different car.
- **Result:** ❌ Killed — phantom car issue
- **Learned:**
1. close_and_switch() between segments creates phantom car risk for single-track training
2. throttle_min MUST be passed consistently — module default is 0.2, not 0.5
3. For single-track training: do NOT use close_and_switch() at all
### Exp 8 — Mountain track, v5 reward + episode termination, throttle_min=0.5 consistently (RUNNING NOW)
- **Reward:** v5 + short-lap terminates episode
- **throttle_min:** 0.5 throughout (no close_and_switch = no module default override)
- **Method:** Direct model.learn() in loop — ONE connection throughout entire run
- **Checkpoints:** 15 numbered saves (every 6,000 steps) + best_model.zip
- **PID:** 2941877, log: /tmp/exp8.log
- **Status:** Running since 11:17, ~1h45m total
- **Watch:** `tail -f /tmp/exp8.log`
- **Success criteria:** Genuine 19-22 second laps appearing during training AND best_model.zip drives cleanly in deterministic eval
---
## Wave 4 Multi-Track Experiments (generated_track + mountain_track)
### Trial 9 ⭐ BEST OVERALL MODEL
- **Model:** `models/wave4-trial-0009/model.zip`
- **Tracks:** generated_track + mountain_track (round-robin, switch every 6,851 steps)
- **Steps:** 89,893 total (~45k per track)
- **Hyperparams:** lr=0.000725, switch=6,851
- **Reward:** v4 (old — before exploit patches)
- **Result:**
- ✅ Drives generated_track (3/3 episodes, 13-16 second genuine laps)
- ✅ Drives mini_monaco zero-shot (2000 steps, 40-second genuine laps — never seen in training)
- ❌ Crashes on mountain_track (~200 steps — hill + corner)
- ❌ Crashes on generated_road (~46 steps — turns right immediately)
- **Notes:** Only 1 of 25 Wave 4 trials succeeded. Suspected random seed luck. Same hyperparameters repeated in Exp2 (overnight) produced useless model.
### Wave 4 Other Trials (1-25 except Trial 9)
- **Result:** All crashed on mini_monaco within 20-265 steps
- **Median mini_monaco score:** ~112 (crashes at ~130 steps)
- **Trials 14, 25:** Scored 1573, 1543 — suspected shuttle exploit (car going back and forth on straight)
- **Learned:** Multi-track training is highly sensitive to random seed. GP+UCB did not converge reliably.
---
## Key Decisions Made (What We Keep)
| Decision | Reason |
|---|---|
| v5 reward: `speed × CTE-quality` | Directly incentivises throttle on hills. v4 gave zero gradient on inclines. |
| throttle_min=0.5 for mountain_track | Overcomes hill. Car can now reach first corner. |
| Short-lap penalty + EPISODE TERMINATION | Penalty alone insufficient — model stayed alive and accumulated rewards between laps. Termination makes circling strictly unprofitable. |
| Numbered checkpoints every segment | Never lose a good mid-training model again (ADR-017) |
| best_model.zip updated on new best segment score | Final model ≠ best model. Peak can be at 30k even if final is at 90k. |
| Single TCP connection for single-track training | Avoids phantom car problem from close_and_switch() |
| lr=0.000725 | From Trial 9 (best model). Consistent with good results. |
## Key Problems Still Open
| Problem | Status |
|---|---|
| Mountain track circle exploit | Partially fixed — episode termination added. Exp8 will show if it holds. |
| Mountain track — car can't navigate first corner reliably | Still being investigated. Exp5 showed genuine laps so it IS solvable. |
| Multi-track generalization is random-seed dependent | No reliable solution yet. Trial 9 was lucky. |
| Mountain track model doesn't generalise to other tracks | Expected — single track training generalises poorly. Next step after Exp8 succeeds. |
---
## Next Steps (Proposed, Not Yet Run)
1. **Exp 8 result:** If best_model.zip drives mountain_track reliably → proceed to Step 2
2. **Combine mountain_track + generated_track** using v5 reward, throttle_min=0.5, proper checkpointing
3. **Test combined model** on all 4 tracks — can it generalise to mini_monaco like Trial 9 did?
4. **If yes:** We have reproduced Trial 9 reliably with a better reward function
### Exp 8 — Mountain track, v5 reward, throttle_min=0.5, CORRECT checkpointing ✅ COMPLETED
- **Reward:** v5 (speed × CTE-quality)
- **throttle_min:** 0.5
- **Method:** Direct model.learn() loop, single TCP connection, NO close_and_switch
- **Steps:** 90,000 total | 6,000 per segment | 15 checkpoints
- **Circle exploit fix:** Short-lap terminates episode immediately
- **Peak segment:** Seg 10 (step 60,000) — 567 reward / 2000 steps (FULL EVAL on mountain_track!)
- **Policy diverged:** Seg 11-15 (31, 20 reward) — best_model.zip captured the peak correctly
- **Checkpoints saved:** checkpoint_0006000.zip through checkpoint_0090000.zip + best_model.zip
- **Final eval results using best_model.zip (step 60k weights):**
| Track | Ep1 | Ep2 | Ep3 | Mean steps | Result |
|---|---|---|---|---|---|
| mountain_track (training) | 382 | 529 | 182 | 364 | ❌ crashes |
| generated_track (zero-shot) | 63 | 61 | 61 | 62 | ❌ crashes |
| mini_monaco (zero-shot) | 154 | 155 | 104 | 138 | ❌ crashes at one corner |
| generated_road (zero-shot) | 41 | 42 | 41 | 41 | ❌ crashes |
- **Throttle test:** mini_monaco at throttle_min=0.5 over 5 episodes: 93/94/79/95/94 steps (mean=91, very consistent = same corner every time). throttle_min=0.2 test impossible — action space baked in at training time.
- **Key findings:**
1. ✅ Circle exploit fully eliminated — no short laps observed
2. ✅ Best model saving worked — captured step 60k peak, not step 90k drift
3. ✅ Genuine 20-22 second laps during training from step ~18k onward
4. ❌ Model crashes at exactly the same corner on mini_monaco every time (too fast)
5. ❌ throttle_min=0.5 baked into action space — model cannot output throttle < 0.5, cannot slow for corners
6. 🔑 INSIGHT: v4 + 0.2 failed because v4 gradient = 0 on hill. v5 gradient is non-zero model CAN learn to apply high throttle when needed even with 0.2 floor
### Exp 9 — Mountain track, v5 reward, throttle_min=0.2 (RUNNING)
- **Change from Exp8:** throttle_min: 0.5 **0.2** (only change)
- **Reward:** v5 (speed × CTE-quality) UNCHANGED
- **Hypothesis:** v5 reward provides non-zero gradient signal on hill (∂reward/∂speed is non-zero).
Model CAN learn to output high throttle on hill. With 0.2 floor, model has full range [0.2, 1.0]
and can apply lower throttle on corners potentially solving mini_monaco corner crash.
- **What we never tested:** (0.2, v4) failed. (0.5, v5) worked. (0.2, v5) was never tried.
- **Risk:** Model may still stall on hill if gradient convergence is slow in early training.
StuckTermination (-1.0) + v5 speed gradient together should push toward higher throttle.
- **Next test (Exp10):** Add track_progress bonus to reward (v6) one variable at a time.
- **Save dir:** models/exp9-mountain-v5-throttle02/
- **Watch:** tail -f /tmp/exp9.log
### Exp 9 — Evaluation Results (3-set test, 1 run per track per set)
**Model tested:** `models/exp9-mountain-v5-throttle02/best_model.zip`
**Date:** 2026-04-18
**Test setup:** 3 independent sets, lighting randomises each run (no fixed seed)
| Track | Set 1 | Set 2 | Set 3 | Mean | Pattern |
|---|---|---|---|---|---|
| mountain_track (trained) | 2000 | 2000 | 2000 | **2000** | Rock solid |
| generated_track (zero-shot) | 79 | 61 | 82 | **74** | Always fails can't make first corner |
| generated_road (zero-shot) | 651 | 2000 | 1203 | **1285** | Highly variable lighting dependent |
| mini_monaco (zero-shot) | 32 | 60 | 34 | **42** | Always fails veers right immediately |
**User observations:**
- mountain_track: 80-90% of time on or near centre yellow line. Solid driving.
- generated_road: Driving looks good when it works, but goes off course. Lighting variation causes inconsistency.
- generated_track: Cannot make first corner at all. Model sees nothing it recognises.
- mini_monaco: Veers right immediately at start before any visible driving. Crashes before reaching the road.
**Key finding — Lighting effect confirmed:**
Generated_road varies 65120001203 with identical model and track. ONLY lighting changes.
Mountain_track is immune because it trained under many random lighting conditions.
Generated_track and mini_monaco fail regardless of lighting visual domain too different.
**What this tells us about next steps:**
Train on mountain_track + generated_track together (v5 reward, throttle_min=0.2).
Both tracks have random lighting each episode model forced to learn lighting-invariant features.
Goal: model that is reliable on both training tracks, then test generalisation to generated_road and mini_monaco.
### Exp 10 — Two tracks: generated_track + mountain_track, v5 reward, throttle_min=0.2
- **Change from Exp9:** Added generated_track as second training track
- **Reward:** v5 (speed × CTE) unchanged
- **throttle_min:** 0.2 unchanged from Exp9
- **Training tracks:** generated_track + mountain_track (round-robin, switch every 6,000 steps)
- **Total steps:** 90,000 | Steps per switch: 6,000 | ~7.5 rotations through both tracks
- **lr:** 0.000725 unchanged
- **Hypothesis:** Adding generated_track visual diversity forces model to learn
lighting-invariant road-following features. Mountain_track teaches hill throttle.
Together should generalise better to generated_road and potentially mini_monaco.
- **Expected results:** mountain_track reliable, generated_track reliable,
generated_road improved, mini_monaco TBD
- **This is essentially Trial 9 repeated with:** v5 reward + throttle_min=0.2 + proper checkpointing + exploit fix