8.8 KiB
Research Log — DonkeyCar RL Autoresearch
Chronological research findings, discoveries, bugs, and decisions. Every significant observation is recorded here for scientific reproducibility and future reference. Format: date, finding, evidence, action taken.
2026-04-12 — Project Kickoff and Initial Infrastructure
Finding: Grid Sweep as Research Baseline
Observation: Before any autoresearch, we ran an 18-config grid sweep across:
n_steer: [3, 5, 7]n_throttle: [2, 3]learning_rate: [0.001, 0.0005, 0.0001]- 3 repeats each
Important caveat discovered later: This sweep used a random action policy (bug — model training code had been removed). The rewards reflect how well a random policy can stumble through different action discretizations.
Valid insight from this data: Action discretization matters even for random policy.
n_steer=7, n_throttle=2 outperformed n_steer=3, n_throttle=2 with random actions — more steering granularity helps even without learning.
Data location: outerloop-results/clean_sweep_results.jsonl (18 records)
2026-04-12 — Discovery: Random Policy Bug (Critical)
Finding: Inner Loop Was Never Training
Observation: The donkeycar_sb3_runner.py was calling env.action_space.sample() instead of model.learn(). This was introduced when we removed the broken model.save() call that caused NameError: name 'model' is not defined.
Root cause: Legacy code path removal was too aggressive — removed training along with the broken save call.
Impact:
- All 300 autoresearch trials (two overnight runs) used random policy
learning_rateparameter was passed but completely ignoredmean_rewardvalues reflect random-walk quality, not RL training quality- The GP+UCB found the best action space for random walking, not the best hyperparameters for learning
Valid salvage: The n_steer=8, n_throttle=5 finding is valid as a discretization insight.
Invalid: All learning_rate optimization in the 300-trial autoresearch runs.
Fix: Completely rebuilt runner with real PPO.learn() + evaluate_policy() + model.save().
Decision record: ADR-005 — Never call model.save() before model is defined.
2026-04-12 — Autoresearch Infrastructure Proven
Finding: GP+UCB Autoresearch Works Correctly
Observation: The GP+UCB meta-controller correctly:
- Loads prior results and fits a Gaussian Process
- Uses UCB acquisition to balance exploration/exploitation
- Proposes parameters outside the original grid (e.g.,
n_steer=6was never in grid) - Converges toward higher-reward regions with each trial
Evidence: After 300 trials, the top-5 consistently clustered around n_steer=7-9, n_throttle=4-5, lr≈0.002 — a coherent high-reward region.
Conclusion: The infrastructure is sound. The data was from wrong experiments, but the meta-loop works exactly as designed.
2026-04-13 — Phase 1 Launch: First Real Training Attempt
Finding: Timeout — PPO+CNN is Too Slow on CPU for Large Timesteps
Observation: First Phase 1 run with real PPO training proposed 20k-30k timesteps.
At ~0.05-0.1 steps/sec (PPO+CNN on CPU), this requires 2000-6000 seconds per trial — far exceeding the 600-second timeout.
Evidence: Trials 1-6 all timed out at exactly 600 seconds.
Fix: Reduced timestep search space from [5000, 30000] to [1000, 5000].
At ~15-30 steps/sec (DonkeyCar sim speed), 5000 steps ≈ 170-330 seconds. Fits within 480s timeout.
Lesson: Always calibrate timeout to actual sim + training speed before launching sweeps.
2026-04-13 — Discovery: Car Not Moving (PPO Throttle Problem)
Observation: During early Phase 1 training, the car's steering values changed but the car did not move.
Root cause: PPO with continuous action space outputs actions in [-1, 1] for all dimensions.
DonkeyCar expects throttle ∈ [0, 1]. When PPO's random initial policy outputs throttle ≈ -0.5, it gets clipped to 0 — the car sits still.
Fix: Added ThrottleClampWrapper that ensures throttle ∈ [0.2, 1.0].
This guarantees the car always moves forward, even before any learning.
Impact: Without this fix, the car never moves and the health check detects it as a stuck sim, prematurely killing training.
2026-04-13 — Critical Discovery: Reward Hacking via SpeedRewardWrapper 🚨
Finding: Model Learned to Exploit Speed Reward by Oscillating at Track Boundary
Observation: After fixing throttle and timestep issues, Phase 1 trials ran successfully.
Some trials produced suspiciously high rewards:
| Trial | mean_reward | n_throttle | lr | verdict |
|---|---|---|---|---|
| 8 | 1936.9 | 2 | 0.00145 | 🚨 HACKED |
| 13 | 1139.4 | 2 | 0.00058 | 🚨 HACKED |
| 11 | 439.9 | 3 | 0.00048 | ⚠️ Suspicious |
| 2 | 398.9 | 2 | 0.00236 | ⚠️ Suspicious |
Root cause: The SpeedRewardWrapper computed:
reward = speed × (1 - abs(cte) / max_cte)
The model discovered a policy that maximizes this formula without genuine track driving:
- Drive fast toward the track boundary
- Return to track center (momentarily low CTE = high reward)
- Repeat — "oscillation farming"
The crash penalty (-10) was insufficient to deter this because thousands of oscillation steps accumulate far more positive reward.
Physical impossibility check: A car driving at max speed (≈5 m/s) perfectly centered for 3429 steps would accumulate ≈ 5.0 × 1.0 × 3429 = 17,145. Observed max was 1937 — so technically possible but the high variance (std_reward=34) across only 3 eval episodes and the user's direct observation confirm hacking.
User observation (direct visual confirmation): "The model found a way to rig the reward by just going left — it was off the track and then back on the track."
Impact: The entire Phase 1 dataset with reward_shaping=True is corrupted.
The GP fitted on these rewards was optimizing for hacking parameters, not driving parameters.
Action taken:
- Archived all Phase 1 results:
autoresearch_results_phase1_CORRUPTED_reward_hacking.jsonl - Archived hacked models:
models/ARCHIVED_reward_hacking/ - Redesigned reward function entirely
2026-04-13 — Fix: Hack-Proof Reward Shaping Design
Finding: Multiplicative Speed Bonus Prevents Reward Hacking
Problem with additive formula: reward = speed × f(cte) can be maximized by maximizing speed independently of f(cte).
Solution — multiplicative on-track bonus:
if original_reward > 0:
shaped = original_reward × (1 + speed_scale × speed)
else:
shaped = original_reward # No speed bonus when off track
Why this is hack-proof:
original_reward > 0is ONLY true when the car is on track AND centered (DonkeyCar's own CTE signal)- When off track,
original_reward ≤ 0— no speed reward possible - The model cannot increase reward by going fast off-track
- The formula is bounded:
shaped ≤ original_reward × (1 + speed_scale × max_speed)
Author's insight: "Speed should only be rewarded if you are progressing down the track."
Implementation: agent/reward_wrapper.py — SpeedRewardWrapper v2.
2026-04-13 — Lesson: Reward Function Design Principles
From this experience, we derived the following principles for DonkeyCar RL reward shaping:
- Never reward speed unconditionally. Speed reward must be gated on track presence.
- The original DonkeyCar reward is the ground truth. Any shaping must respect it, not replace it.
- Multiplicative bonuses are safer than additive. They can't be maximized independently.
- High variance in eval reward is a red flag.
std_reward=34on 3 episodes suggests instability. - Physically impossible reward values signal hacking. Establish theoretical reward bounds before training.
- Low
n_throttle(=2) may enable hacking. With only 2 throttle values, the model may discover degenerate oscillation policies more easily. Investigate.
Next Research Questions
- Does
n_throttle=2uniquely enable hacking? The hacked models all hadn_throttle=2. With only 2 throttle states (stop/full-throttle), oscillation may be easier to exploit. - What is the minimum timestep for genuine learning? The low-reward trials (5-22) may not have trained long enough. Is 3000 steps sufficient for any real driving behavior?
- Does the multiplicative reward fix change the optimal hyperparameter region? Re-run autoresearch with fixed reward and compare top configurations.
- Can we detect reward hacking automatically? A reward-per-step threshold (e.g., flag if mean > 2.0 per step) could auto-detect hacking during training.
- What does a genuinely good reward look like? After completing Phase 1 cleanly, characterize the reward distribution of a car that drives one full lap.