# Project Specification — DonkeyCar RL Autoresearch **Version:** 1.0.0 **Date:** 2026-04-13 **Owner:** paulh **Status:** Active --- ## 1. Project Overview ### What are we building? An end-to-end autonomous research and training system for DonkeyCar reinforcement learning agents. The system: 1. Trains DQN/PPO RL agents in the DonkeyCar simulator using Stable-Baselines3 2. Saves the best-performing models to disk after every training run 3. Uses a Gaussian Process + UCB Bayesian autoresearch controller to intelligently propose and evaluate new hyperparameter configurations — learning from every run 4. Produces a champion model capable of driving a DonkeyCar on any track at maximum speed with minimum cross-track error The project replaces manual hyperparameter tuning and random grid sweeps with a self-directing autoresearch loop that gets smarter with each trial. ### Why does it matter? Manual hyperparameter search for RL is slow, expensive, and non-systematic. The DonkeyCar task (fast, stable lap driving generalizable across tracks) requires careful tuning of the action space, reward function, and learning parameters. A Bayesian autoresearch loop: - Finds better configurations than grid search with fewer trials - Discovers non-obvious parameter regions (e.g., n_steer=8, n_throttle=5 emerged from autoresearch, not from the grid) - Creates a reproducible, logged, version-controlled research artifact - Enables unattended overnight experimentation with full observability ### Success Criteria - [ ] Inner loop trains a real PPO/DQN model for a configurable number of timesteps and saves the best model to disk - [ ] Autoresearch controller proposes hyperparameters using GP+UCB and evaluates trained models (not random policy) - [ ] Champion model (highest eval reward across all trials) is saved separately and can be loaded for demonstration - [ ] Champion model can complete at least one lap on the training track with mean_reward > 100 - [ ] Champion model generalizes to at least one unseen track (mean_reward > 50 on eval track) - [ ] All results are logged, versioned, and pushed to Gitea automatically - [ ] System can run unattended overnight with zero hangs or zombie processes - [ ] Full documentation exists: PRD, architecture, decisions, implementation plan, evals --- ## 2. Technical Foundation ### Tech stack - **Language:** Python 3.10 - **RL Framework:** Stable-Baselines3 (SB3) — PPO and DQN - **Simulator:** DonkeyCar Gym (gym_donkeycar) running locally on port 9091 - **Gym Interface:** Gymnasium (gymnasium) - **Surrogate Model:** Pure numpy Gaussian Process (TinyGP — no sklearn required) - **Action Wrapper:** Custom DiscretizedActionWrapper (discretize_action.py) - **Version Control:** Git + Gitea (https://paje.ca/git/paulh/donkeycar-rl-autoresearch) - **Test Framework:** pytest - **Logging:** JSON Lines (JSONL) + human-readable log files ### Project Structure ``` donkeycar-rl-autoresearch/ ├── AGENT.md ← Agent instructions (this harness) ├── PROJECT-SPEC.md ← This file ├── DECISIONS.md ← Architecture Decision Records ├── IMPLEMENTATION_PLAN.md ← Master task backlog ├── README.md ← Project overview ├── .gitignore ├── .harness/ │ ├── EXECUTION_MASTER.md ← Wave/stream dashboard │ ├── templates/ ← Harness templates │ ├── regression-baselines/ ← Saved eval baselines │ └── / │ ├── execution-board.md │ ├── process-eval.md │ └── validation/ ├── agent/ │ ├── autoresearch_controller.py ← GP+UCB autoresearch loop │ ├── donkeycar_sb3_runner.py ← Inner loop: real training + model save │ ├── donkeycar_outer_loop.py ← Grid sweep (legacy baseline) │ ├── discretize_action.py ← Action space wrapper │ ├── outerloop-results/ │ │ ├── clean_sweep_results.jsonl ← Base sweep data (18 records) │ │ ├── autoresearch_results.jsonl ← Autoresearch trial results │ │ └── autoresearch_log.txt ← Human-readable autoresearch log │ └── models/ │ ├── champion/ ← Best model across all trials │ └── trial-/ ← Per-trial saved models └── tests/ ├── test_discretize_action.py ├── test_autoresearch_controller.py └── test_runner_integration.py ``` ### Build & Test Commands ```bash # Run all tests cd /home/paulh/projects/donkeycar-rl-autoresearch python3 -m pytest tests/ -v # Run autoresearch controller (requires sim running on port 9091) cd agent && python3 autoresearch_controller.py --trials 50 # Run single training trial manually cd agent && python3 donkeycar_sb3_runner.py --agent ppo --timesteps 10000 --eval-episodes 5 # Check Gitea push cd /home/paulh/projects/donkeycar-rl-autoresearch && git push ``` ### Coding Standards - All output uses `flush=True` for real-time log visibility - Every process must call `env.close()` and `time.sleep(2)` before exit (proven zombie prevention) - All results are appended to JSONL files — never overwritten - Model saves use `model.save(path)` from SB3 standard API - Champion model tracking: autoresearch writes `champion_model_path` to results JSONL - No `model.save()` calls on undefined variables — always check model exists before saving - Python only — no TypeScript, no Node --- ## 3. Requirements ### Functional Requirements #### FR-001: Real RL Training in Inner Loop **Description:** The inner RL runner (`donkeycar_sb3_runner.py`) must actually train a PPO or DQN model using `model.learn(total_timesteps=N)`, not run random actions. **Acceptance criteria:** - [ ] Given `--agent ppo --timesteps 10000`, the runner trains a PPO model for 10000 steps - [ ] Training uses the `learning_rate` argument passed from the autoresearch controller - [ ] Training uses the discretized action space (n_steer, n_throttle) when DQN is used - [ ] PPO runs with continuous actions (no discretization needed) - [ ] Training completes without hanging and exits with code 0 #### FR-002: Model Saving **Description:** After each training run, the trained model is saved to disk. **Acceptance criteria:** - [ ] Model saved to `agent/models/trial-/model.zip` after every successful run - [ ] If eval reward is the best seen so far, model is also copied to `agent/models/champion/model.zip` - [ ] Save path is logged to the JSONL results file - [ ] Model can be loaded with `PPO.load()` or `DQN.load()` for subsequent evaluation #### FR-003: Real Policy Evaluation **Description:** After training, the model is evaluated using the learned policy (not random actions). **Acceptance criteria:** - [ ] `evaluate_policy(model, env, n_eval_episodes=N)` is used for evaluation - [ ] Mean reward and std reward are both recorded - [ ] Evaluation uses the same action wrapper as training - [ ] Per-episode rewards are printed for full observability #### FR-004: Autoresearch GP+UCB Controller **Description:** The autoresearch controller proposes hyperparameters using Gaussian Process + UCB acquisition, learning from prior results. **Acceptance criteria:** - [ ] Controller loads ALL prior results (base sweep + autoresearch history) at startup - [ ] GP is fit on encoded (normalized) parameter vectors and corresponding eval rewards - [ ] UCB acquisition = GP mean + kappa * GP std - [ ] Next trial parameters maximize UCB over N_CANDIDATES random samples - [ ] Controller logs top-5 UCB candidates before each trial - [ ] Controller correctly handles first 2 trials (insufficient data for GP — uses random sampling) #### FR-005: Champion Model Tracking **Description:** The system maintains a single "champion" model — the best-performing model across all trials. **Acceptance criteria:** - [ ] After each trial, if `mean_reward > current_best`, the model is saved as champion - [ ] Champion metadata (params, reward, trial number, timestamp) saved to `champion_manifest.json` - [ ] Champion model path is stable: `agent/models/champion/model.zip` - [ ] Champion can be loaded and demonstrated without retraining #### FR-006: Speed-Aware Reward Shaping **Description:** The reward function incentivizes speed, not just staying on track. **Acceptance criteria:** - [ ] Custom reward wrapper computes: `reward = speed * (1 - abs(cte) / max_cte)` - [ ] Speed and CTE values are accessible from the DonkeyCar info dict - [ ] Reward wrapper is optional (enabled via `--reward-shaping` flag) - [ ] Without flag, default DonkeyCar reward is used unchanged #### FR-007: Multi-Track Generalization Evaluation **Description:** The champion model is evaluated on at least one track it was NOT trained on. **Acceptance criteria:** - [ ] Evaluation script accepts `--track` argument to specify evaluation track - [ ] Champion model is loaded and evaluated for N episodes on the specified track - [ ] Results (mean_reward, per-episode rewards) are logged - [ ] Generalization gap (train_reward - eval_reward) is reported #### FR-008: Autoresearch Results Logging **Description:** Every trial produces a complete, structured result record. **Acceptance criteria:** - [ ] JSONL record includes: trial_id, timestamp, params, mean_reward, std_reward, model_path, champion_flag, elapsed_sec, run_status - [ ] Autoresearch log (human-readable) is updated after every trial - [ ] Results file is never truncated — only appended - [ ] Results are pushed to Gitea after every N trials (configurable, default 10) #### FR-009: Unattended Overnight Operation **Description:** The system runs for 100+ trials without hanging, zombie processes, or data loss. **Acceptance criteria:** - [ ] Every job calls `env.close()` before exit - [ ] 2-second cooldown between jobs prevents race conditions - [ ] Stale process kill (`pkill -9 -f donkeycar_sb3_runner.py`) before each new job - [ ] 6-minute timeout per job — killed and logged if exceeded - [ ] System auto-resumes from existing results if restarted mid-sweep #### FR-010: Test Suite **Description:** Core logic is covered by automated tests that don't require the simulator. **Acceptance criteria:** - [ ] `test_discretize_action.py` — tests action space wrapping correctness - [ ] `test_autoresearch_controller.py` — tests GP fitting, UCB computation, param encoding/decoding - [ ] `test_runner_integration.py` — mocked simulator test of training + save + eval cycle - [ ] All tests pass with `pytest tests/ -v` - [ ] No tests require a running simulator ### Non-Functional Requirements #### NFR-001: Performance - [ ] Each training trial completes in < 6 minutes for 10000 timesteps - [ ] GP fitting on 300 data points completes in < 2 seconds - [ ] System does not consume > 8GB RAM per trial #### NFR-002: Robustness - [ ] Zero hanging jobs across 100 consecutive trials - [ ] All errors are caught, logged, and do not crash the autoresearch loop - [ ] System correctly handles sim disconnection and logs the failure #### NFR-003: Reproducibility - [ ] All results are version-controlled in Gitea - [ ] Every trial records the exact parameters used - [ ] Results are deterministic given the same seed (seed support in runner) #### NFR-004: Observability - [ ] Real-time per-step reward printing during training and evaluation - [ ] Per-trial summary logged to both console and file - [ ] Running champion summary printed after every trial --- ## 4. Data Model ### Trial Result Record (JSONL) ```json { "trial": 42, "timestamp": "2026-04-13T03:14:15.926535", "params": { "agent": "ppo", "n_steer": 7, "n_throttle": 3, "learning_rate": 0.0003, "timesteps": 10000, "eval_episodes": 5, "reward_shaping": false }, "mean_reward": 127.45, "std_reward": 18.3, "model_path": "agent/models/trial-042/model.zip", "champion": true, "elapsed_sec": 187.4, "run_status": "ok" } ``` ### Champion Manifest (`agent/models/champion/manifest.json`) ```json { "trial": 42, "timestamp": "2026-04-13T03:14:15.926535", "params": { "..." }, "mean_reward": 127.45, "model_path": "agent/models/champion/model.zip" } ``` ### GP State (in-memory, rebuilt each iteration from JSONL) ``` X: [N, n_params] normalized parameter vectors y: [N] normalized mean rewards GP: TinyGP fitted to (X, y) ``` --- ## 5. Interface Design ### Runner CLI (`donkeycar_sb3_runner.py`) ```bash python3 donkeycar_sb3_runner.py \ --agent ppo|dqn \ --env donkey-generated-roads-v0 \ --timesteps 10000 \ --eval-episodes 5 \ --n-steer 7 \ --n-throttle 3 \ --learning-rate 0.0003 \ --save-dir agent/models/trial-042 \ --seed 42 \ --reward-shaping ``` ### Autoresearch Controller CLI ```bash python3 autoresearch_controller.py \ --trials 100 \ --explore 2.0 \ --agent ppo \ --min-timesteps 5000 \ --max-timesteps 20000 \ --push-every 10 ``` ### Evaluation / Demo CLI (`evaluate_champion.py`) ```bash python3 evaluate_champion.py \ --model agent/models/champion/model.zip \ --env donkey-mountain-track-v0 \ --episodes 10 ``` --- ## 6. Architecture Decisions ### Constraints - **MUST:** Always call `env.close()` before process exit - **MUST:** Save every trained model — never discard - **MUST:** Use `evaluate_policy()` from SB3 for evaluation — not a custom loop - **MUST:** Append to JSONL results — never overwrite - **MUST:** All tests run without a live simulator - **MUST NOT:** Use `model.save()` before `model` is defined - **MUST NOT:** Run random actions in production inner loop (this was the original bug) - **MUST NOT:** Remove the 2-second cooldown between jobs - **PREFER:** PPO over DQN for continuous driving tasks (better suited) - **PREFER:** Pure numpy GP over sklearn to avoid dependency issues - **PREFER:** Reward shaping enabled by default for speed optimization - **ESCALATE:** If DonkeyCar gym API changes break env.reset() or env.step() signatures - **ESCALATE:** If simulator port 9091 is unavailable at test time - **ESCALATE:** If SB3 model save/load API changes between versions ### Known Challenges 1. **Simulator must be running:** All live training requires the DonkeyCar sim on port 9091. Tests must mock this. 2. **Episode length variance:** Episodes end at 100 steps or CTE > 8. Mean reward has high variance across episodes. 3. **Random seed handling:** DonkeyCar gym.reset() signature differs between Gym and Gymnasium versions. 4. **Model size:** PPO models with CNN policy on 120x160x3 images can be large (>100MB). Consider git LFS or exclude from git. ### Rejected Approaches | Rejected option | Why rejected | Scope | |-----------------|-------------|-------| | Random action inner loop | Produces meaningless reward signal — cannot optimize for trained driving | project | | sklearn GP | Adds sklearn dependency, compatibility issues found previously | project | | DQN for continuous actions | DQN requires discretized actions, PPO handles continuous natively | project | | Grid sweep as primary search | Fixed grid misses best regions; GP+UCB finds n_steer=8, n_throttle=5 which was not in grid | project | | 100/200 trial arbitrary batches | No principled stopping criterion; should use convergence detection instead | project | | model.save() from legacy training function | Was undefined — caused NameError crash on every run for entire history | project | --- ## 7. Phasing ### Phase 1: Real Training Foundation (CURRENT — implement first) Core goal: make the inner loop actually train and save models. - [ ] Rebuild `donkeycar_sb3_runner.py` with real PPO/DQN training + save - [ ] Add speed-aware reward shaping wrapper - [ ] Add proper `evaluate_policy()` evaluation - [ ] Fix autoresearch controller to pass `learning_rate` to runner - [ ] Add champion model tracking - [ ] Write tests for all core logic - [ ] Re-run autoresearch with real training (50 trials minimum) ### Phase 2: Generalization (after Phase 1 champion exists) Core goal: the champion model drives ANY track. - [ ] Multi-track evaluation script - [ ] Curriculum learning: train on 2+ tracks - [ ] Domain randomization wrapper - [ ] Convergence detection in autoresearch (stop when GP uncertainty collapses) - [ ] Automatic Gitea push every N trials ### Phase 3: Racing (after Phase 2 — generalization proven) Core goal: fastest possible lap times. - [ ] Lap time measurement and logging - [ ] Reward function tuned for pure speed (with safety constraints) - [ ] Fine-tuning from champion checkpoint on new tracks - [ ] Head-to-head comparison: autoresearch champion vs human-tuned config - [ ] Research paper / writeup structure --- ## 8. Reference Materials ### External Docs - DonkeyCar Gym: https://github.com/tawnkramer/gym-donkeycar - Stable-Baselines3: https://stable-baselines3.readthedocs.io/ - Gymnasium migration: https://gymnasium.farama.org/introduction/migration_guide/ ### Existing Code to Learn From - `agent/discretize_action.py` — action space wrapper (working, tested in production) - `agent/autoresearch_controller.py` — GP+UCB loop (working, needs inner loop fix) - `agent/outerloop-results/clean_sweep_results.jsonl` — 18 records of base data - `agent/outerloop-results/autoresearch_results.jsonl` — 300 trial records (random policy — useful for discretization insights, NOT for learning_rate tuning) ### Anti-patterns (DO NOT REPEAT) - Calling `model.save()` before `model` is defined — crashes with NameError - Using `env.action_space.sample()` in the "training" loop — this is random, not RL - Ignoring the `learning_rate` argument in the runner (was passed but unused for 300 trials) - Arbitrary trial count limits — use convergence detection instead - Not calling `env.close()` — causes simulator zombie/hang --- ## 9. Evaluation Design ### RL Eval Approach Unlike software unit tests, RL reward is stochastic. Evaluation strategy: - Run N_EVAL_EPISODES per trial (default 5) - Record mean ± std reward - Champion = highest mean reward across all trials - Convergence = GP uncertainty (sigma) drops below threshold across all candidates ### Test Cases (Simulator-Free) #### TC-001: Action Space Encoding **Input:** n_steer=5, n_throttle=3 → action index 7 **Expected:** Decoded to approximately (steer=0.0, throttle=0.5) **Verification:** `pytest tests/test_discretize_action.py::test_decode_action` #### TC-002: GP Fit and UCB Proposal **Input:** 18 data points from clean_sweep_results.jsonl **Expected:** GP proposes params with n_steer ∈ [6,9] and lr ∈ [0.001, 0.004] (the high-reward region identified in 300 trials) **Verification:** `pytest tests/test_autoresearch_controller.py::test_ucb_proposal_in_high_reward_region` #### TC-003: Param Encoding Round-Trip **Input:** `{'n_steer': 7, 'n_throttle': 3, 'learning_rate': 0.002}` **Expected:** encode → decode round-trip reproduces exact values (within int rounding) **Verification:** `pytest tests/test_autoresearch_controller.py::test_param_roundtrip` #### TC-004: Champion Tracking **Input:** Trial sequence with rewards [50, 80, 60, 90, 70] **Expected:** Champion is updated at trials 1, 2, 4 (rewards 50, 80, 90) **Verification:** `pytest tests/test_autoresearch_controller.py::test_champion_tracking` #### TC-005: Runner Exits Cleanly **Input:** Mocked gym environment, 100 timesteps, PPO **Expected:** Runner completes, calls env.close(), exits with code 0, model.zip exists **Verification:** `pytest tests/test_runner_integration.py::test_runner_exits_cleanly` ### Regression Baselines Saved after Phase 1 completion: - `best_params_after_300_random_trials.json` — discretization insight baseline - `champion_reward_phase1.txt` — first real training champion reward