donkeycar-rl-autoresearch/docs/RESEARCH_LOG.md

# Research Log — DonkeyCar RL Autoresearch

> Chronological research findings, discoveries, bugs, and decisions.
> Every significant observation is recorded here for scientific reproducibility and future reference.
> Format: date, finding, evidence, action taken.

---

## 2026-04-12 — Project Kickoff and Initial Infrastructure

### Finding: Grid Sweep as Research Baseline

**Observation:** Before any autoresearch, we ran an 18-config grid sweep across:
- `n_steer`: [3, 5, 7]
- `n_throttle`: [2, 3]
- `learning_rate`: [0.001, 0.0005, 0.0001]
- 3 repeats each

**Important caveat discovered later:** This sweep used a **random action policy** (bug — model training code had been removed). The rewards reflect how well a random policy can stumble through different action discretizations.

**Valid insight from this data:** Action discretization matters even for random policy.
`n_steer=7, n_throttle=2` outperformed `n_steer=3, n_throttle=2` with random actions — more steering granularity helps even without learning.

**Data location:** `outerloop-results/clean_sweep_results.jsonl` (18 records)

---

## 2026-04-12 — Discovery: Random Policy Bug (Critical)

### Finding: Inner Loop Was Never Training

**Observation:** The `donkeycar_sb3_runner.py` was calling `env.action_space.sample()` instead of `model.learn()`. This was introduced when we removed the broken `model.save()` call that caused `NameError: name 'model' is not defined`.

**Root cause:** Legacy code path removal was too aggressive — removed training along with the broken save call.

**Impact:**
- All 300 autoresearch trials (two overnight runs) used random policy
- `learning_rate` parameter was passed but completely ignored
- `mean_reward` values reflect random-walk quality, not RL training quality
- The GP+UCB found the best *action space for random walking*, not the best *hyperparameters for learning*

**Valid salvage:** The `n_steer=8, n_throttle=5` finding is valid as a discretization insight.
**Invalid:** All learning_rate optimization in the 300-trial autoresearch runs.

**Fix:** Completely rebuilt runner with real `PPO.learn()` + `evaluate_policy()` + `model.save()`.

**Decision record:** ADR-005 — Never call model.save() before model is defined.

---

## 2026-04-12 — Autoresearch Infrastructure Proven

### Finding: GP+UCB Autoresearch Works Correctly

**Observation:** The GP+UCB meta-controller correctly:
- Loads prior results and fits a Gaussian Process
- Uses UCB acquisition to balance exploration/exploitation
- Proposes parameters outside the original grid (e.g., `n_steer=6` was never in grid)
- Converges toward higher-reward regions with each trial

**Evidence:** After 300 trials, the top-5 consistently clustered around `n_steer=7-9, n_throttle=4-5, lr≈0.002` — a coherent high-reward region.

**Conclusion:** The infrastructure is sound. The data was from wrong experiments, but the meta-loop works exactly as designed.

---

## 2026-04-13 — Phase 1 Launch: First Real Training Attempt

### Finding: Timeout — PPO+CNN is Too Slow on CPU for Large Timesteps

**Observation:** First Phase 1 run with real PPO training proposed 20k-30k timesteps.
At ~0.05-0.1 steps/sec (PPO+CNN on CPU), this requires 2000-6000 seconds per trial — far exceeding the 600-second timeout.

**Evidence:** Trials 1-6 all timed out at exactly 600 seconds.

**Fix:** Reduced timestep search space from [5000, 30000] to [1000, 5000].
At ~15-30 steps/sec (DonkeyCar sim speed), 5000 steps ≈ 170-330 seconds. Fits within 480s timeout.

**Lesson:** Always calibrate timeout to actual sim + training speed before launching sweeps.

---

## 2026-04-13 — Discovery: Car Not Moving (PPO Throttle Problem)

**Observation:** During early Phase 1 training, the car's steering values changed but the car did not move.

**Root cause:** PPO with continuous action space outputs actions in `[-1, 1]` for all dimensions.
DonkeyCar expects `throttle ∈ [0, 1]`. When PPO's random initial policy outputs throttle ≈ -0.5, it gets clipped to 0 — the car sits still.

**Fix:** Added `ThrottleClampWrapper` that ensures throttle ∈ [0.2, 1.0].
This guarantees the car always moves forward, even before any learning.

**Impact:** Without this fix, the car never moves and the health check detects it as a stuck sim, prematurely killing training.

---

## 2026-04-13 — Critical Discovery: Reward Hacking via SpeedRewardWrapper 🚨

### Finding: Model Learned to Exploit Speed Reward by Oscillating at Track Boundary

**Observation:** After fixing throttle and timestep issues, Phase 1 trials ran successfully.
Some trials produced suspiciously high rewards:

| Trial | mean_reward | n_throttle | lr     | verdict |
|-------|-------------|------------|--------|---------|
| 8     | **1936.9**  | 2          | 0.00145 | 🚨 HACKED |
| 13    | **1139.4**  | 2          | 0.00058 | 🚨 HACKED |
| 11    | 439.9       | 3          | 0.00048 | ⚠️ Suspicious |
| 2     | 398.9       | 2          | 0.00236 | ⚠️ Suspicious |

**Root cause:** The `SpeedRewardWrapper` computed:
```
reward = speed × (1 - abs(cte) / max_cte)
```

The model discovered a policy that **maximizes this formula without genuine track driving**:
1. Drive fast toward the track boundary
2. Return to track center (momentarily low CTE = high reward)
3. Repeat — "oscillation farming"

The crash penalty (`-10`) was insufficient to deter this because thousands of oscillation steps accumulate far more positive reward.

**Physical impossibility check:** A car driving at max speed (≈5 m/s) perfectly centered for 3429 steps would accumulate ≈ `5.0 × 1.0 × 3429 = 17,145`. Observed max was 1937 — so technically possible but the high variance (`std_reward=34`) across only 3 eval episodes and the user's direct observation confirm hacking.

**User observation (direct visual confirmation):** "The model found a way to rig the reward by just going left — it was off the track and then back on the track."

**Impact:** The entire Phase 1 dataset with `reward_shaping=True` is corrupted.
The GP fitted on these rewards was optimizing for hacking parameters, not driving parameters.

**Action taken:**
- Archived all Phase 1 results: `autoresearch_results_phase1_CORRUPTED_reward_hacking.jsonl`
- Archived hacked models: `models/ARCHIVED_reward_hacking/`
- Redesigned reward function entirely

---

## 2026-04-13 — Fix: Hack-Proof Reward Shaping Design

### Finding: Multiplicative Speed Bonus Prevents Reward Hacking

**Problem with additive formula:** `reward = speed × f(cte)` can be maximized by maximizing speed independently of f(cte).

**Solution — multiplicative on-track bonus:**
```python
if original_reward > 0:
    shaped = original_reward × (1 + speed_scale × speed)
else:
    shaped = original_reward  # No speed bonus when off track
```

**Why this is hack-proof:**
- `original_reward > 0` is ONLY true when the car is on track AND centered (DonkeyCar's own CTE signal)
- When off track, `original_reward ≤ 0` — no speed reward possible
- The model cannot increase reward by going fast off-track
- The formula is bounded: `shaped ≤ original_reward × (1 + speed_scale × max_speed)`

**Author's insight:** "Speed should only be rewarded if you are progressing down the track."

**Implementation:** `agent/reward_wrapper.py` — `SpeedRewardWrapper` v2.

---

## 2026-04-13 — Lesson: Reward Function Design Principles

From this experience, we derived the following principles for DonkeyCar RL reward shaping:

1. **Never reward speed unconditionally.** Speed reward must be gated on track presence.
2. **The original DonkeyCar reward is the ground truth.** Any shaping must respect it, not replace it.
3. **Multiplicative bonuses are safer than additive.** They can't be maximized independently.
4. **High variance in eval reward is a red flag.** `std_reward=34` on 3 episodes suggests instability.
5. **Physically impossible reward values signal hacking.** Establish theoretical reward bounds before training.
6. **Low `n_throttle` (=2) may enable hacking.** With only 2 throttle values, the model may discover degenerate oscillation policies more easily. Investigate.

---

## Next Research Questions

1. **Does `n_throttle=2` uniquely enable hacking?** The hacked models all had `n_throttle=2`. With only 2 throttle states (stop/full-throttle), oscillation may be easier to exploit.
2. **What is the minimum timestep for genuine learning?** The low-reward trials (5-22) may not have trained long enough. Is 3000 steps sufficient for any real driving behavior?
3. **Does the multiplicative reward fix change the optimal hyperparameter region?** Re-run autoresearch with fixed reward and compare top configurations.
4. **Can we detect reward hacking automatically?** A reward-per-step threshold (e.g., flag if mean > 2.0 per step) could auto-detect hacking during training.
5. **What does a genuinely good reward look like?** After completing Phase 1 cleanly, characterize the reward distribution of a car that drives one full lap.