16 KiB
Research Log — DonkeyCar RL Autoresearch
Chronological research findings, discoveries, bugs, and decisions. Every significant observation is recorded here for scientific reproducibility and future reference. Format: date, finding, evidence, action taken.
2026-04-12 — Project Kickoff and Initial Infrastructure
Finding: Grid Sweep as Research Baseline
Observation: Before any autoresearch, we ran an 18-config grid sweep across:
n_steer: [3, 5, 7]n_throttle: [2, 3]learning_rate: [0.001, 0.0005, 0.0001]- 3 repeats each
Important caveat discovered later: This sweep used a random action policy (bug — model training code had been removed). The rewards reflect how well a random policy can stumble through different action discretizations.
Valid insight from this data: Action discretization matters even for random policy.
n_steer=7, n_throttle=2 outperformed n_steer=3, n_throttle=2 with random actions — more steering granularity helps even without learning.
Data location: outerloop-results/clean_sweep_results.jsonl (18 records)
2026-04-12 — Discovery: Random Policy Bug (Critical)
Finding: Inner Loop Was Never Training
Observation: The donkeycar_sb3_runner.py was calling env.action_space.sample() instead of model.learn(). This was introduced when we removed the broken model.save() call that caused NameError: name 'model' is not defined.
Root cause: Legacy code path removal was too aggressive — removed training along with the broken save call.
Impact:
- All 300 autoresearch trials (two overnight runs) used random policy
learning_rateparameter was passed but completely ignoredmean_rewardvalues reflect random-walk quality, not RL training quality- The GP+UCB found the best action space for random walking, not the best hyperparameters for learning
Valid salvage: The n_steer=8, n_throttle=5 finding is valid as a discretization insight.
Invalid: All learning_rate optimization in the 300-trial autoresearch runs.
Fix: Completely rebuilt runner with real PPO.learn() + evaluate_policy() + model.save().
Decision record: ADR-005 — Never call model.save() before model is defined.
2026-04-12 — Autoresearch Infrastructure Proven
Finding: GP+UCB Autoresearch Works Correctly
Observation: The GP+UCB meta-controller correctly:
- Loads prior results and fits a Gaussian Process
- Uses UCB acquisition to balance exploration/exploitation
- Proposes parameters outside the original grid (e.g.,
n_steer=6was never in grid) - Converges toward higher-reward regions with each trial
Evidence: After 300 trials, the top-5 consistently clustered around n_steer=7-9, n_throttle=4-5, lr≈0.002 — a coherent high-reward region.
Conclusion: The infrastructure is sound. The data was from wrong experiments, but the meta-loop works exactly as designed.
2026-04-13 — Phase 1 Launch: First Real Training Attempt
Finding: Timeout — PPO+CNN is Too Slow on CPU for Large Timesteps
Observation: First Phase 1 run with real PPO training proposed 20k-30k timesteps.
At ~0.05-0.1 steps/sec (PPO+CNN on CPU), this requires 2000-6000 seconds per trial — far exceeding the 600-second timeout.
Evidence: Trials 1-6 all timed out at exactly 600 seconds.
Fix: Reduced timestep search space from [5000, 30000] to [1000, 5000].
At ~15-30 steps/sec (DonkeyCar sim speed), 5000 steps ≈ 170-330 seconds. Fits within 480s timeout.
Lesson: Always calibrate timeout to actual sim + training speed before launching sweeps.
2026-04-13 — Discovery: Car Not Moving (PPO Throttle Problem)
Observation: During early Phase 1 training, the car's steering values changed but the car did not move.
Root cause: PPO with continuous action space outputs actions in [-1, 1] for all dimensions.
DonkeyCar expects throttle ∈ [0, 1]. When PPO's random initial policy outputs throttle ≈ -0.5, it gets clipped to 0 — the car sits still.
Fix: Added ThrottleClampWrapper that ensures throttle ∈ [0.2, 1.0].
This guarantees the car always moves forward, even before any learning.
Impact: Without this fix, the car never moves and the health check detects it as a stuck sim, prematurely killing training.
2026-04-13 — Critical Discovery: Reward Hacking via SpeedRewardWrapper 🚨
Finding: Model Learned to Exploit Speed Reward by Oscillating at Track Boundary
Observation: After fixing throttle and timestep issues, Phase 1 trials ran successfully.
Some trials produced suspiciously high rewards:
| Trial | mean_reward | n_throttle | lr | verdict |
|---|---|---|---|---|
| 8 | 1936.9 | 2 | 0.00145 | 🚨 HACKED |
| 13 | 1139.4 | 2 | 0.00058 | 🚨 HACKED |
| 11 | 439.9 | 3 | 0.00048 | ⚠️ Suspicious |
| 2 | 398.9 | 2 | 0.00236 | ⚠️ Suspicious |
Root cause: The SpeedRewardWrapper computed:
reward = speed × (1 - abs(cte) / max_cte)
The model discovered a policy that maximizes this formula without genuine track driving:
- Drive fast toward the track boundary
- Return to track center (momentarily low CTE = high reward)
- Repeat — "oscillation farming"
The crash penalty (-10) was insufficient to deter this because thousands of oscillation steps accumulate far more positive reward.
Physical impossibility check: A car driving at max speed (≈5 m/s) perfectly centered for 3429 steps would accumulate ≈ 5.0 × 1.0 × 3429 = 17,145. Observed max was 1937 — so technically possible but the high variance (std_reward=34) across only 3 eval episodes and the user's direct observation confirm hacking.
User observation (direct visual confirmation): "The model found a way to rig the reward by just going left — it was off the track and then back on the track."
Impact: The entire Phase 1 dataset with reward_shaping=True is corrupted.
The GP fitted on these rewards was optimizing for hacking parameters, not driving parameters.
Action taken:
- Archived all Phase 1 results:
autoresearch_results_phase1_CORRUPTED_reward_hacking.jsonl - Archived hacked models:
models/ARCHIVED_reward_hacking/ - Redesigned reward function entirely
2026-04-13 — Fix: Hack-Proof Reward Shaping Design
Finding: Multiplicative Speed Bonus Prevents Reward Hacking
Problem with additive formula: reward = speed × f(cte) can be maximized by maximizing speed independently of f(cte).
Solution — multiplicative on-track bonus:
if original_reward > 0:
shaped = original_reward × (1 + speed_scale × speed)
else:
shaped = original_reward # No speed bonus when off track
Why this is hack-proof:
original_reward > 0is ONLY true when the car is on track AND centered (DonkeyCar's own CTE signal)- When off track,
original_reward ≤ 0— no speed reward possible - The model cannot increase reward by going fast off-track
- The formula is bounded:
shaped ≤ original_reward × (1 + speed_scale × max_speed)
Author's insight: "Speed should only be rewarded if you are progressing down the track."
Implementation: agent/reward_wrapper.py — SpeedRewardWrapper v2.
2026-04-13 — Lesson: Reward Function Design Principles
From this experience, we derived the following principles for DonkeyCar RL reward shaping:
- Never reward speed unconditionally. Speed reward must be gated on track presence.
- The original DonkeyCar reward is the ground truth. Any shaping must respect it, not replace it.
- Multiplicative bonuses are safer than additive. They can't be maximized independently.
- High variance in eval reward is a red flag.
std_reward=34on 3 episodes suggests instability. - Physically impossible reward values signal hacking. Establish theoretical reward bounds before training.
- Low
n_throttle(=2) may enable hacking. With only 2 throttle values, the model may discover degenerate oscillation policies more easily. Investigate.
Next Research Questions
- Does
n_throttle=2uniquely enable hacking? The hacked models all hadn_throttle=2. With only 2 throttle states (stop/full-throttle), oscillation may be easier to exploit. - What is the minimum timestep for genuine learning? The low-reward trials (5-22) may not have trained long enough. Is 3000 steps sufficient for any real driving behavior?
- Does the multiplicative reward fix change the optimal hyperparameter region? Re-run autoresearch with fixed reward and compare top configurations.
- Can we detect reward hacking automatically? A reward-per-step threshold (e.g., flag if mean > 2.0 per step) could auto-detect hacking during training.
- What does a genuinely good reward look like? After completing Phase 1 cleanly, characterize the reward distribution of a car that drives one full lap.
2026-04-13 — Critical Discovery: Circular Driving Exploit (v2 Reward Still Hackable)
Finding: Car Learns to Circle at Starting Line
User observation (direct visual): "The model found a way to rig the reward by going left in circles — it was off the track and then back on track, but detected as failure. Model uses this as best way to maximize reward."
Data confirmation:
| Trial | mean_reward | std_reward | cv% | r/step | verdict |
|---|---|---|---|---|---|
| 1 | 270.56 | 0.143 | 0.1% | 0.086 | ⚠️ CIRCULAR (suspiciously low std) |
| 5 | 4582.80 | 0.485 | 0.0% | 0.957 | 🚨 CIRCULAR (confirmed) |
| 10 | 682.74 | 420.91 | 61.7% | 0.153 | ⚠️ UNSTABLE (sometimes circles, sometimes crashes) |
Statistical signature of circular motion:
- cv (coefficient of variation = std/mean) < 1% with high reward → very consistent behavior
- Circular driving IS very consistent: every circle is the same
- Legitimate driving is stochastic: different obstacles, curves, luck
- Trial 5: cv=0.0% over 3 eval episodes → textbook circling
Why v2 reward still allowed this:
- v2 fix:
reward = original × (1 + speed_scale × speed)ONLY when on track - Car circling at the starting line HAS: low CTE (on track centerline) + positive speed
- Result: full speed bonus for circling → 4582 reward over 4787 steps
- CTE and raw speed cannot distinguish forward from circular motion
Root Cause: Missing Dimension — Track Progress
The fundamental issue: neither CTE nor speed captures PROGRESS along the track.
- CTE measures: am I near the centerline? (yes for circles)
- Speed measures: am I moving? (yes for circles)
- Progress measures: am I getting anywhere new? (NO for circles)
Fix: Path Efficiency Reward (v3)
Formula:
efficiency = net_displacement / total_path_length (over sliding window of 30 steps)
shaped_reward = original_reward × (1 + speed_scale × speed × efficiency)
Why this works:
- Forward driving:
efficiency ≈ 1.0(all movement is productive) - Circular driving:
efficiency ≈ 0.0(lots of steps, car returns to start position) - The speed bonus disappears when circling → car incentivized to go FORWARD
Proof (tests):
test_efficiency_near_zero_for_circular_driving: efficiency < 0.2 after full circletest_efficiency_near_one_for_straight_driving: efficiency > 0.90 for straight linetest_straight_driving_gets_higher_reward_than_circular: key guarantee test
Data archived:
autoresearch_results_phase1_CORRUPTED_circular_driving.jsonl(12 records, circular)models/ARCHIVED_circular_driving/(trial-0001 through trial-0013)
Lesson: cv% is a Reward Hacking Indicator
| cv% | Interpretation |
|---|---|
| < 1% + high reward | Likely reward hacking (very consistent exploit) |
| 1-10% | Normal RL variance |
| > 50% | Unstable policy, inconsistent behavior |
This metric will be added to the autoresearch result logging and summary.
2026-04-13 — 🏆 PHASE 1 MILESTONE: Genuine Track Driving Confirmed!
Finding: Champion Model Drives the Track — Real RL Behaviour Proven
This is the first confirmed genuine driving result from the autoresearch pipeline.
Visual confirmation (user): "It is definitely driving! The donkeycar is driving along the track!"
Evaluation data — 3 episodes, 1500 max steps:
| Episode | Steps | Total Reward | Std | Efficiency |
|---|---|---|---|---|
| 1 | 599 | 1022.73 | — | 96-100% |
| 2 | 598 | 1023.35 | — | 96-100% |
| 3 | 599 | 1022.25 | — | 96-100% |
| Mean | 599 | 1022.78 | 0.45 | ~99% |
Champion Model Parameters:
- agent: PPO, n_steer=7, n_throttle=3, lr=0.000680, timesteps=4787
- Path:
agent/models/champion/model.zip
Track Trajectory Analysis
Start: Pos(6.25, 6.30) → Starting line
Step 300: Pos(22.80, 2.09) → Long straight, approaching first corner
Step 400: Pos(18.80, -6.96) → Negotiating first right-hand curve ✅
Step 500: Pos(28.12, -5.61) → Continuing along second straight
Step 560: Pos(33.12, -6.55) → Approaching second corner
Step 599: CRASH CTE=8.26 → Off track at second corner ❌
The car successfully:
- Accelerates from 0 → 2.3 m/s along the straight
- Navigates the first right-hand curve
- Follows the track for ~600 steps covering ~30+ position units
Failure Analysis: The S-Curve Crash
User observation: "The spot where the donkeycar goes off the track is during a right hand curve which quickly turns into a left hand curve. It doesn't even look like it sees the left hand curve."
What the data shows:
- Steps 540-560: CTE briefly near zero (0.24) — car approaches corner well
- Steps 570+: CTE explodes 1.4 → 3.8 → 5.9 → 8.3 — car overshoots
- Speed at crash: 2.23-2.30 m/s — too fast for the S-curve
Root cause: Only 4787 training timesteps — insufficient to learn:
- Speed reduction approaching corners
- Left-turn recovery after right-hand overshoot
- S-curve geometry (right → quick left transition)
Key insight: The model never sees the left-hand curve because it has always crashed at the right-hand part first during training. This is an exploration problem — the car needs more timesteps to get past this point and discover what's beyond.
Reward Shaping Victory
All 3 reward hacking fixes proved necessary and correct:
- v1 additive → boundary oscillation exploit
- v2 multiplicative → circular driving exploit
- v3 path efficiency → genuine forward driving ✅
The path efficiency metric (96-100% throughout entire run) confirms the car is making continuous forward progress — not circling, not oscillating.
Phase 1 → Phase 2 Transition
Phase 1 objective achieved: A real PPO model drives the DonkeyCar track with genuine forward motion, consistent behaviour (std=0.45), and correct trajectory.
Next objective (targeted autoresearch): Learn corner handling and speed modulation.
- Increase timesteps to 10,000-50,000 per trial
- The model needs to see the S-curve many times to learn the transition
- Consider adding a CTE-rate-of-change penalty to discourage high speed at high CTE
This is Research!
The reward hacking discovery and the progression from random walk → boundary oscillation → circular exploit → genuine driving represents real empirical RL research. Each failure mode revealed a fundamental property of reward design. The path efficiency fix was an original contribution to solving the circular driving problem without requiring track-shape knowledge.