donkeycar-rl-autoresearch/PROJECT-SPEC.md

# Project Specification — DonkeyCar RL Autoresearch

**Version:** 1.0.0
**Date:** 2026-04-13
**Owner:** paulh
**Status:** Active

---

## 1. Project Overview

### What are we building?

An end-to-end autonomous research and training system for DonkeyCar reinforcement learning agents. The system:
1. Trains DQN/PPO RL agents in the DonkeyCar simulator using Stable-Baselines3
2. Saves the best-performing models to disk after every training run
3. Uses a Gaussian Process + UCB Bayesian autoresearch controller to intelligently propose and evaluate new hyperparameter configurations — learning from every run
4. Produces a champion model capable of driving a DonkeyCar on any track at maximum speed with minimum cross-track error

The project replaces manual hyperparameter tuning and random grid sweeps with a self-directing autoresearch loop that gets smarter with each trial.

### Why does it matter?

Manual hyperparameter search for RL is slow, expensive, and non-systematic. The DonkeyCar task (fast, stable lap driving generalizable across tracks) requires careful tuning of the action space, reward function, and learning parameters. A Bayesian autoresearch loop:
- Finds better configurations than grid search with fewer trials
- Discovers non-obvious parameter regions (e.g., n_steer=8, n_throttle=5 emerged from autoresearch, not from the grid)
- Creates a reproducible, logged, version-controlled research artifact
- Enables unattended overnight experimentation with full observability

### Success Criteria

- [ ] Inner loop trains a real PPO/DQN model for a configurable number of timesteps and saves the best model to disk
- [ ] Autoresearch controller proposes hyperparameters using GP+UCB and evaluates trained models (not random policy)
- [ ] Champion model (highest eval reward across all trials) is saved separately and can be loaded for demonstration
- [ ] Champion model can complete at least one lap on the training track with mean_reward > 100
- [ ] Champion model generalizes to at least one unseen track (mean_reward > 50 on eval track)
- [ ] All results are logged, versioned, and pushed to Gitea automatically
- [ ] System can run unattended overnight with zero hangs or zombie processes
- [ ] Full documentation exists: PRD, architecture, decisions, implementation plan, evals

---

## 2. Technical Foundation

### Tech stack

- **Language:** Python 3.10
- **RL Framework:** Stable-Baselines3 (SB3) — PPO and DQN
- **Simulator:** DonkeyCar Gym (gym_donkeycar) running locally on port 9091
- **Gym Interface:** Gymnasium (gymnasium)
- **Surrogate Model:** Pure numpy Gaussian Process (TinyGP — no sklearn required)
- **Action Wrapper:** Custom DiscretizedActionWrapper (discretize_action.py)
- **Version Control:** Git + Gitea (https://paje.ca/git/paulh/donkeycar-rl-autoresearch)
- **Test Framework:** pytest
- **Logging:** JSON Lines (JSONL) + human-readable log files

### Project Structure

```
donkeycar-rl-autoresearch/
├── AGENT.md                            ← Agent instructions (this harness)
├── PROJECT-SPEC.md                     ← This file
├── DECISIONS.md                        ← Architecture Decision Records
├── IMPLEMENTATION_PLAN.md              ← Master task backlog
├── README.md                           ← Project overview
├── .gitignore
├── .harness/
│   ├── EXECUTION_MASTER.md             ← Wave/stream dashboard
│   ├── templates/                      ← Harness templates
│   ├── regression-baselines/           ← Saved eval baselines
│   └── <stream-name>/
│       ├── execution-board.md
│       ├── process-eval.md
│       └── validation/
├── agent/
│   ├── autoresearch_controller.py      ← GP+UCB autoresearch loop
│   ├── donkeycar_sb3_runner.py         ← Inner loop: real training + model save
│   ├── donkeycar_outer_loop.py         ← Grid sweep (legacy baseline)
│   ├── discretize_action.py            ← Action space wrapper
│   ├── outerloop-results/
│   │   ├── clean_sweep_results.jsonl   ← Base sweep data (18 records)
│   │   ├── autoresearch_results.jsonl  ← Autoresearch trial results
│   │   └── autoresearch_log.txt        ← Human-readable autoresearch log
│   └── models/
│       ├── champion/                   ← Best model across all trials
│       └── trial-<N>/                  ← Per-trial saved models
└── tests/
    ├── test_discretize_action.py
    ├── test_autoresearch_controller.py
    └── test_runner_integration.py
```

### Build & Test Commands

```bash
# Run all tests
cd /home/paulh/projects/donkeycar-rl-autoresearch
python3 -m pytest tests/ -v

# Run autoresearch controller (requires sim running on port 9091)
cd agent && python3 autoresearch_controller.py --trials 50

# Run single training trial manually
cd agent && python3 donkeycar_sb3_runner.py --agent ppo --timesteps 10000 --eval-episodes 5

# Check Gitea push
cd /home/paulh/projects/donkeycar-rl-autoresearch && git push
```

### Coding Standards

- All output uses `flush=True` for real-time log visibility
- Every process must call `env.close()` and `time.sleep(2)` before exit (proven zombie prevention)
- All results are appended to JSONL files — never overwritten
- Model saves use `model.save(path)` from SB3 standard API
- Champion model tracking: autoresearch writes `champion_model_path` to results JSONL
- No `model.save()` calls on undefined variables — always check model exists before saving
- Python only — no TypeScript, no Node

---

## 3. Requirements

### Functional Requirements

#### FR-001: Real RL Training in Inner Loop
**Description:** The inner RL runner (`donkeycar_sb3_runner.py`) must actually train a PPO or DQN model using `model.learn(total_timesteps=N)`, not run random actions.
**Acceptance criteria:**
- [ ] Given `--agent ppo --timesteps 10000`, the runner trains a PPO model for 10000 steps
- [ ] Training uses the `learning_rate` argument passed from the autoresearch controller
- [ ] Training uses the discretized action space (n_steer, n_throttle) when DQN is used
- [ ] PPO runs with continuous actions (no discretization needed)
- [ ] Training completes without hanging and exits with code 0

#### FR-002: Model Saving
**Description:** After each training run, the trained model is saved to disk.
**Acceptance criteria:**
- [ ] Model saved to `agent/models/trial-<N>/model.zip` after every successful run
- [ ] If eval reward is the best seen so far, model is also copied to `agent/models/champion/model.zip`
- [ ] Save path is logged to the JSONL results file
- [ ] Model can be loaded with `PPO.load()` or `DQN.load()` for subsequent evaluation

#### FR-003: Real Policy Evaluation
**Description:** After training, the model is evaluated using the learned policy (not random actions).
**Acceptance criteria:**
- [ ] `evaluate_policy(model, env, n_eval_episodes=N)` is used for evaluation
- [ ] Mean reward and std reward are both recorded
- [ ] Evaluation uses the same action wrapper as training
- [ ] Per-episode rewards are printed for full observability

#### FR-004: Autoresearch GP+UCB Controller
**Description:** The autoresearch controller proposes hyperparameters using Gaussian Process + UCB acquisition, learning from prior results.
**Acceptance criteria:**
- [ ] Controller loads ALL prior results (base sweep + autoresearch history) at startup
- [ ] GP is fit on encoded (normalized) parameter vectors and corresponding eval rewards
- [ ] UCB acquisition = GP mean + kappa * GP std
- [ ] Next trial parameters maximize UCB over N_CANDIDATES random samples
- [ ] Controller logs top-5 UCB candidates before each trial
- [ ] Controller correctly handles first 2 trials (insufficient data for GP — uses random sampling)

#### FR-005: Champion Model Tracking
**Description:** The system maintains a single "champion" model — the best-performing model across all trials.
**Acceptance criteria:**
- [ ] After each trial, if `mean_reward > current_best`, the model is saved as champion
- [ ] Champion metadata (params, reward, trial number, timestamp) saved to `champion_manifest.json`
- [ ] Champion model path is stable: `agent/models/champion/model.zip`
- [ ] Champion can be loaded and demonstrated without retraining

#### FR-006: Speed-Aware Reward Shaping
**Description:** The reward function incentivizes speed, not just staying on track.
**Acceptance criteria:**
- [ ] Custom reward wrapper computes: `reward = speed * (1 - abs(cte) / max_cte)`
- [ ] Speed and CTE values are accessible from the DonkeyCar info dict
- [ ] Reward wrapper is optional (enabled via `--reward-shaping` flag)
- [ ] Without flag, default DonkeyCar reward is used unchanged

#### FR-007: Multi-Track Generalization Evaluation
**Description:** The champion model is evaluated on at least one track it was NOT trained on.
**Acceptance criteria:**
- [ ] Evaluation script accepts `--track` argument to specify evaluation track
- [ ] Champion model is loaded and evaluated for N episodes on the specified track
- [ ] Results (mean_reward, per-episode rewards) are logged
- [ ] Generalization gap (train_reward - eval_reward) is reported

#### FR-008: Autoresearch Results Logging
**Description:** Every trial produces a complete, structured result record.
**Acceptance criteria:**
- [ ] JSONL record includes: trial_id, timestamp, params, mean_reward, std_reward, model_path, champion_flag, elapsed_sec, run_status
- [ ] Autoresearch log (human-readable) is updated after every trial
- [ ] Results file is never truncated — only appended
- [ ] Results are pushed to Gitea after every N trials (configurable, default 10)

#### FR-009: Unattended Overnight Operation
**Description:** The system runs for 100+ trials without hanging, zombie processes, or data loss.
**Acceptance criteria:**
- [ ] Every job calls `env.close()` before exit
- [ ] 2-second cooldown between jobs prevents race conditions
- [ ] Stale process kill (`pkill -9 -f donkeycar_sb3_runner.py`) before each new job
- [ ] 6-minute timeout per job — killed and logged if exceeded
- [ ] System auto-resumes from existing results if restarted mid-sweep

#### FR-010: Test Suite
**Description:** Core logic is covered by automated tests that don't require the simulator.
**Acceptance criteria:**
- [ ] `test_discretize_action.py` — tests action space wrapping correctness
- [ ] `test_autoresearch_controller.py` — tests GP fitting, UCB computation, param encoding/decoding
- [ ] `test_runner_integration.py` — mocked simulator test of training + save + eval cycle
- [ ] All tests pass with `pytest tests/ -v`
- [ ] No tests require a running simulator

### Non-Functional Requirements

#### NFR-001: Performance
- [ ] Each training trial completes in < 6 minutes for 10000 timesteps
- [ ] GP fitting on 300 data points completes in < 2 seconds
- [ ] System does not consume > 8GB RAM per trial

#### NFR-002: Robustness
- [ ] Zero hanging jobs across 100 consecutive trials
- [ ] All errors are caught, logged, and do not crash the autoresearch loop
- [ ] System correctly handles sim disconnection and logs the failure

#### NFR-003: Reproducibility
- [ ] All results are version-controlled in Gitea
- [ ] Every trial records the exact parameters used
- [ ] Results are deterministic given the same seed (seed support in runner)

#### NFR-004: Observability
- [ ] Real-time per-step reward printing during training and evaluation
- [ ] Per-trial summary logged to both console and file
- [ ] Running champion summary printed after every trial

---

## 4. Data Model

### Trial Result Record (JSONL)

```json
{
  "trial": 42,
  "timestamp": "2026-04-13T03:14:15.926535",
  "params": {
    "agent": "ppo",
    "n_steer": 7,
    "n_throttle": 3,
    "learning_rate": 0.0003,
    "timesteps": 10000,
    "eval_episodes": 5,
    "reward_shaping": false
  },
  "mean_reward": 127.45,
  "std_reward": 18.3,
  "model_path": "agent/models/trial-042/model.zip",
  "champion": true,
  "elapsed_sec": 187.4,
  "run_status": "ok"
}
```

### Champion Manifest (`agent/models/champion/manifest.json`)

```json
{
  "trial": 42,
  "timestamp": "2026-04-13T03:14:15.926535",
  "params": { "..." },
  "mean_reward": 127.45,
  "model_path": "agent/models/champion/model.zip"
}
```

### GP State (in-memory, rebuilt each iteration from JSONL)

```
X: [N, n_params]  normalized parameter vectors
y: [N]            normalized mean rewards
GP: TinyGP fitted to (X, y)
```

---

## 5. Interface Design

### Runner CLI (`donkeycar_sb3_runner.py`)

```bash
python3 donkeycar_sb3_runner.py \
  --agent ppo|dqn \
  --env donkey-generated-roads-v0 \
  --timesteps 10000 \
  --eval-episodes 5 \
  --n-steer 7 \
  --n-throttle 3 \
  --learning-rate 0.0003 \
  --save-dir agent/models/trial-042 \
  --seed 42 \
  --reward-shaping
```

### Autoresearch Controller CLI

```bash
python3 autoresearch_controller.py \
  --trials 100 \
  --explore 2.0 \
  --agent ppo \
  --min-timesteps 5000 \
  --max-timesteps 20000 \
  --push-every 10
```

### Evaluation / Demo CLI (`evaluate_champion.py`)

```bash
python3 evaluate_champion.py \
  --model agent/models/champion/model.zip \
  --env donkey-mountain-track-v0 \
  --episodes 10
```

---

## 6. Architecture Decisions

### Constraints

- **MUST:** Always call `env.close()` before process exit
- **MUST:** Save every trained model — never discard
- **MUST:** Use `evaluate_policy()` from SB3 for evaluation — not a custom loop
- **MUST:** Append to JSONL results — never overwrite
- **MUST:** All tests run without a live simulator
- **MUST NOT:** Use `model.save()` before `model` is defined
- **MUST NOT:** Run random actions in production inner loop (this was the original bug)
- **MUST NOT:** Remove the 2-second cooldown between jobs
- **PREFER:** PPO over DQN for continuous driving tasks (better suited)
- **PREFER:** Pure numpy GP over sklearn to avoid dependency issues
- **PREFER:** Reward shaping enabled by default for speed optimization
- **ESCALATE:** If DonkeyCar gym API changes break env.reset() or env.step() signatures
- **ESCALATE:** If simulator port 9091 is unavailable at test time
- **ESCALATE:** If SB3 model save/load API changes between versions

### Known Challenges

1. **Simulator must be running:** All live training requires the DonkeyCar sim on port 9091. Tests must mock this.
2. **Episode length variance:** Episodes end at 100 steps or CTE > 8. Mean reward has high variance across episodes.
3. **Random seed handling:** DonkeyCar gym.reset() signature differs between Gym and Gymnasium versions.
4. **Model size:** PPO models with CNN policy on 120x160x3 images can be large (>100MB). Consider git LFS or exclude from git.

### Rejected Approaches

| Rejected option | Why rejected | Scope |
|-----------------|-------------|-------|
| Random action inner loop | Produces meaningless reward signal — cannot optimize for trained driving | project |
| sklearn GP | Adds sklearn dependency, compatibility issues found previously | project |
| DQN for continuous actions | DQN requires discretized actions, PPO handles continuous natively | project |
| Grid sweep as primary search | Fixed grid misses best regions; GP+UCB finds n_steer=8, n_throttle=5 which was not in grid | project |
| 100/200 trial arbitrary batches | No principled stopping criterion; should use convergence detection instead | project |
| model.save() from legacy training function | Was undefined — caused NameError crash on every run for entire history | project |

---

## 7. Phasing

### Phase 1: Real Training Foundation (CURRENT — implement first)
Core goal: make the inner loop actually train and save models.
- [ ] Rebuild `donkeycar_sb3_runner.py` with real PPO/DQN training + save
- [ ] Add speed-aware reward shaping wrapper
- [ ] Add proper `evaluate_policy()` evaluation
- [ ] Fix autoresearch controller to pass `learning_rate` to runner
- [ ] Add champion model tracking
- [ ] Write tests for all core logic
- [ ] Re-run autoresearch with real training (50 trials minimum)

### Phase 2: Generalization (after Phase 1 champion exists)
Core goal: the champion model drives ANY track.
- [ ] Multi-track evaluation script
- [ ] Curriculum learning: train on 2+ tracks
- [ ] Domain randomization wrapper
- [ ] Convergence detection in autoresearch (stop when GP uncertainty collapses)
- [ ] Automatic Gitea push every N trials

### Phase 3: Racing (after Phase 2 — generalization proven)
Core goal: fastest possible lap times.
- [ ] Lap time measurement and logging
- [ ] Reward function tuned for pure speed (with safety constraints)
- [ ] Fine-tuning from champion checkpoint on new tracks
- [ ] Head-to-head comparison: autoresearch champion vs human-tuned config
- [ ] Research paper / writeup structure

---

## 8. Reference Materials

### External Docs
- DonkeyCar Gym: https://github.com/tawnkramer/gym-donkeycar
- Stable-Baselines3: https://stable-baselines3.readthedocs.io/
- Gymnasium migration: https://gymnasium.farama.org/introduction/migration_guide/

### Existing Code to Learn From
- `agent/discretize_action.py` — action space wrapper (working, tested in production)
- `agent/autoresearch_controller.py` — GP+UCB loop (working, needs inner loop fix)
- `agent/outerloop-results/clean_sweep_results.jsonl` — 18 records of base data
- `agent/outerloop-results/autoresearch_results.jsonl` — 300 trial records (random policy — useful for discretization insights, NOT for learning_rate tuning)

### Anti-patterns (DO NOT REPEAT)
- Calling `model.save()` before `model` is defined — crashes with NameError
- Using `env.action_space.sample()` in the "training" loop — this is random, not RL
- Ignoring the `learning_rate` argument in the runner (was passed but unused for 300 trials)
- Arbitrary trial count limits — use convergence detection instead
- Not calling `env.close()` — causes simulator zombie/hang

---

## 9. Evaluation Design

### RL Eval Approach

Unlike software unit tests, RL reward is stochastic. Evaluation strategy:
- Run N_EVAL_EPISODES per trial (default 5)
- Record mean ± std reward
- Champion = highest mean reward across all trials
- Convergence = GP uncertainty (sigma) drops below threshold across all candidates

### Test Cases (Simulator-Free)

#### TC-001: Action Space Encoding
**Input:** n_steer=5, n_throttle=3 → action index 7
**Expected:** Decoded to approximately (steer=0.0, throttle=0.5)
**Verification:** `pytest tests/test_discretize_action.py::test_decode_action`

#### TC-002: GP Fit and UCB Proposal
**Input:** 18 data points from clean_sweep_results.jsonl
**Expected:** GP proposes params with n_steer ∈ [6,9] and lr ∈ [0.001, 0.004] (the high-reward region identified in 300 trials)
**Verification:** `pytest tests/test_autoresearch_controller.py::test_ucb_proposal_in_high_reward_region`

#### TC-003: Param Encoding Round-Trip
**Input:** `{'n_steer': 7, 'n_throttle': 3, 'learning_rate': 0.002}`
**Expected:** encode → decode round-trip reproduces exact values (within int rounding)
**Verification:** `pytest tests/test_autoresearch_controller.py::test_param_roundtrip`

#### TC-004: Champion Tracking
**Input:** Trial sequence with rewards [50, 80, 60, 90, 70]
**Expected:** Champion is updated at trials 1, 2, 4 (rewards 50, 80, 90)
**Verification:** `pytest tests/test_autoresearch_controller.py::test_champion_tracking`

#### TC-005: Runner Exits Cleanly
**Input:** Mocked gym environment, 100 timesteps, PPO
**Expected:** Runner completes, calls env.close(), exits with code 0, model.zip exists
**Verification:** `pytest tests/test_runner_integration.py::test_runner_exits_cleanly`

### Regression Baselines
Saved after Phase 1 completion:
- `best_params_after_300_random_trials.json` — discretization insight baseline
- `champion_reward_phase1.txt` — first real training champion reward