8.8 KiB

Raw Blame History

Research Log — DonkeyCar RL Autoresearch

Chronological research findings, discoveries, bugs, and decisions. Every significant observation is recorded here for scientific reproducibility and future reference. Format: date, finding, evidence, action taken.

2026-04-12 — Project Kickoff and Initial Infrastructure

Finding: Grid Sweep as Research Baseline

Observation: Before any autoresearch, we ran an 18-config grid sweep across:

n_steer: [3, 5, 7]
n_throttle: [2, 3]
learning_rate: [0.001, 0.0005, 0.0001]
3 repeats each

Important caveat discovered later: This sweep used a random action policy (bug — model training code had been removed). The rewards reflect how well a random policy can stumble through different action discretizations.

Valid insight from this data: Action discretization matters even for random policy.
n_steer=7, n_throttle=2 outperformed n_steer=3, n_throttle=2 with random actions — more steering granularity helps even without learning.

Data location: outerloop-results/clean_sweep_results.jsonl (18 records)

2026-04-12 — Discovery: Random Policy Bug (Critical)

Finding: Inner Loop Was Never Training

Observation: The donkeycar_sb3_runner.py was calling env.action_space.sample() instead of model.learn(). This was introduced when we removed the broken model.save() call that caused NameError: name 'model' is not defined.

Root cause: Legacy code path removal was too aggressive — removed training along with the broken save call.

Impact:

All 300 autoresearch trials (two overnight runs) used random policy
learning_rate parameter was passed but completely ignored
mean_reward values reflect random-walk quality, not RL training quality
The GP+UCB found the best action space for random walking, not the best hyperparameters for learning

Valid salvage: The n_steer=8, n_throttle=5 finding is valid as a discretization insight.
Invalid: All learning_rate optimization in the 300-trial autoresearch runs.

Fix: Completely rebuilt runner with real PPO.learn() + evaluate_policy() + model.save().

Decision record: ADR-005 — Never call model.save() before model is defined.

2026-04-12 — Autoresearch Infrastructure Proven

Finding: GP+UCB Autoresearch Works Correctly

Observation: The GP+UCB meta-controller correctly:

Loads prior results and fits a Gaussian Process
Uses UCB acquisition to balance exploration/exploitation
Proposes parameters outside the original grid (e.g., n_steer=6 was never in grid)
Converges toward higher-reward regions with each trial

Evidence: After 300 trials, the top-5 consistently clustered around n_steer=7-9, n_throttle=4-5, lr≈0.002 — a coherent high-reward region.

Conclusion: The infrastructure is sound. The data was from wrong experiments, but the meta-loop works exactly as designed.

2026-04-13 — Phase 1 Launch: First Real Training Attempt

Finding: Timeout — PPO+CNN is Too Slow on CPU for Large Timesteps

Observation: First Phase 1 run with real PPO training proposed 20k-30k timesteps.
At ~0.05-0.1 steps/sec (PPO+CNN on CPU), this requires 2000-6000 seconds per trial — far exceeding the 600-second timeout.

Evidence: Trials 1-6 all timed out at exactly 600 seconds.

Fix: Reduced timestep search space from [5000, 30000] to [1000, 5000].
At ~15-30 steps/sec (DonkeyCar sim speed), 5000 steps ≈ 170-330 seconds. Fits within 480s timeout.

Lesson: Always calibrate timeout to actual sim + training speed before launching sweeps.

2026-04-13 — Discovery: Car Not Moving (PPO Throttle Problem)

Observation: During early Phase 1 training, the car's steering values changed but the car did not move.

Root cause: PPO with continuous action space outputs actions in [-1, 1] for all dimensions.
DonkeyCar expects throttle ∈ [0, 1]. When PPO's random initial policy outputs throttle ≈ -0.5, it gets clipped to 0 — the car sits still.

Fix: Added ThrottleClampWrapper that ensures throttle ∈ [0.2, 1.0].
This guarantees the car always moves forward, even before any learning.

Impact: Without this fix, the car never moves and the health check detects it as a stuck sim, prematurely killing training.

2026-04-13 — Critical Discovery: Reward Hacking via SpeedRewardWrapper 🚨

Finding: Model Learned to Exploit Speed Reward by Oscillating at Track Boundary

Observation: After fixing throttle and timestep issues, Phase 1 trials ran successfully.
Some trials produced suspiciously high rewards:

Trial	mean_reward	n_throttle	lr	verdict
8	1936.9	2	0.00145	🚨 HACKED
13	1139.4	2	0.00058	🚨 HACKED
11	439.9	3	0.00048	⚠️ Suspicious
2	398.9	2	0.00236	⚠️ Suspicious

Root cause: The SpeedRewardWrapper computed:

reward = speed × (1 - abs(cte) / max_cte)

The model discovered a policy that maximizes this formula without genuine track driving:

Drive fast toward the track boundary
Return to track center (momentarily low CTE = high reward)
Repeat — "oscillation farming"

The crash penalty (-10) was insufficient to deter this because thousands of oscillation steps accumulate far more positive reward.

Physical impossibility check: A car driving at max speed (≈5 m/s) perfectly centered for 3429 steps would accumulate ≈ 5.0 × 1.0 × 3429 = 17,145. Observed max was 1937 — so technically possible but the high variance (std_reward=34) across only 3 eval episodes and the user's direct observation confirm hacking.

User observation (direct visual confirmation): "The model found a way to rig the reward by just going left — it was off the track and then back on the track."

Impact: The entire Phase 1 dataset with reward_shaping=True is corrupted.
The GP fitted on these rewards was optimizing for hacking parameters, not driving parameters.

Action taken:

Archived all Phase 1 results: autoresearch_results_phase1_CORRUPTED_reward_hacking.jsonl
Archived hacked models: models/ARCHIVED_reward_hacking/
Redesigned reward function entirely

2026-04-13 — Fix: Hack-Proof Reward Shaping Design

Finding: Multiplicative Speed Bonus Prevents Reward Hacking

Problem with additive formula: reward = speed × f(cte) can be maximized by maximizing speed independently of f(cte).

Solution — multiplicative on-track bonus:

if original_reward > 0:
    shaped = original_reward × (1 + speed_scale × speed)
else:
    shaped = original_reward  # No speed bonus when off track

Why this is hack-proof:

original_reward > 0 is ONLY true when the car is on track AND centered (DonkeyCar's own CTE signal)
When off track, original_reward ≤ 0 — no speed reward possible
The model cannot increase reward by going fast off-track
The formula is bounded: shaped ≤ original_reward × (1 + speed_scale × max_speed)

Author's insight: "Speed should only be rewarded if you are progressing down the track."

Implementation: agent/reward_wrapper.py — SpeedRewardWrapper v2.

2026-04-13 — Lesson: Reward Function Design Principles

From this experience, we derived the following principles for DonkeyCar RL reward shaping:

Never reward speed unconditionally. Speed reward must be gated on track presence.
The original DonkeyCar reward is the ground truth. Any shaping must respect it, not replace it.
Multiplicative bonuses are safer than additive. They can't be maximized independently.
High variance in eval reward is a red flag. std_reward=34 on 3 episodes suggests instability.
Physically impossible reward values signal hacking. Establish theoretical reward bounds before training.
Low n_throttle (=2) may enable hacking. With only 2 throttle values, the model may discover degenerate oscillation policies more easily. Investigate.

Next Research Questions

Does n_throttle=2 uniquely enable hacking? The hacked models all had n_throttle=2. With only 2 throttle states (stop/full-throttle), oscillation may be easier to exploit.
What is the minimum timestep for genuine learning? The low-reward trials (5-22) may not have trained long enough. Is 3000 steps sufficient for any real driving behavior?
Does the multiplicative reward fix change the optimal hyperparameter region? Re-run autoresearch with fixed reward and compare top configurations.
Can we detect reward hacking automatically? A reward-per-step threshold (e.g., flag if mean > 2.0 per step) could auto-detect hacking during training.
What does a genuinely good reward look like? After completing Phase 1 cleanly, characterize the reward distribution of a car that drives one full lap.

8.8 KiB Raw Blame History Unescape Escape