20 KiB
Project Specification — DonkeyCar RL Autoresearch
Version: 1.0.0
Date: 2026-04-13
Owner: paulh
Status: Active
1. Project Overview
What are we building?
An end-to-end autonomous research and training system for DonkeyCar reinforcement learning agents. The system:
- Trains DQN/PPO RL agents in the DonkeyCar simulator using Stable-Baselines3
- Saves the best-performing models to disk after every training run
- Uses a Gaussian Process + UCB Bayesian autoresearch controller to intelligently propose and evaluate new hyperparameter configurations — learning from every run
- Produces a champion model capable of driving a DonkeyCar on any track at maximum speed with minimum cross-track error
The project replaces manual hyperparameter tuning and random grid sweeps with a self-directing autoresearch loop that gets smarter with each trial.
Why does it matter?
Manual hyperparameter search for RL is slow, expensive, and non-systematic. The DonkeyCar task (fast, stable lap driving generalizable across tracks) requires careful tuning of the action space, reward function, and learning parameters. A Bayesian autoresearch loop:
- Finds better configurations than grid search with fewer trials
- Discovers non-obvious parameter regions (e.g., n_steer=8, n_throttle=5 emerged from autoresearch, not from the grid)
- Creates a reproducible, logged, version-controlled research artifact
- Enables unattended overnight experimentation with full observability
Success Criteria
- Inner loop trains a real PPO/DQN model for a configurable number of timesteps and saves the best model to disk
- Autoresearch controller proposes hyperparameters using GP+UCB and evaluates trained models (not random policy)
- Champion model (highest eval reward across all trials) is saved separately and can be loaded for demonstration
- Champion model can complete at least one lap on the training track with mean_reward > 100
- Champion model generalizes to at least one unseen track (mean_reward > 50 on eval track)
- All results are logged, versioned, and pushed to Gitea automatically
- System can run unattended overnight with zero hangs or zombie processes
- Full documentation exists: PRD, architecture, decisions, implementation plan, evals
2. Technical Foundation
Tech stack
- Language: Python 3.10
- RL Framework: Stable-Baselines3 (SB3) — PPO and DQN
- Simulator: DonkeyCar Gym (gym_donkeycar) running locally on port 9091
- Gym Interface: Gymnasium (gymnasium)
- Surrogate Model: Pure numpy Gaussian Process (TinyGP — no sklearn required)
- Action Wrapper: Custom DiscretizedActionWrapper (discretize_action.py)
- Version Control: Git + Gitea (https://paje.ca/git/paulh/donkeycar-rl-autoresearch)
- Test Framework: pytest
- Logging: JSON Lines (JSONL) + human-readable log files
Project Structure
donkeycar-rl-autoresearch/
├── AGENT.md ← Agent instructions (this harness)
├── PROJECT-SPEC.md ← This file
├── DECISIONS.md ← Architecture Decision Records
├── IMPLEMENTATION_PLAN.md ← Master task backlog
├── README.md ← Project overview
├── .gitignore
├── .harness/
│ ├── EXECUTION_MASTER.md ← Wave/stream dashboard
│ ├── templates/ ← Harness templates
│ ├── regression-baselines/ ← Saved eval baselines
│ └── <stream-name>/
│ ├── execution-board.md
│ ├── process-eval.md
│ └── validation/
├── agent/
│ ├── autoresearch_controller.py ← GP+UCB autoresearch loop
│ ├── donkeycar_sb3_runner.py ← Inner loop: real training + model save
│ ├── donkeycar_outer_loop.py ← Grid sweep (legacy baseline)
│ ├── discretize_action.py ← Action space wrapper
│ ├── outerloop-results/
│ │ ├── clean_sweep_results.jsonl ← Base sweep data (18 records)
│ │ ├── autoresearch_results.jsonl ← Autoresearch trial results
│ │ └── autoresearch_log.txt ← Human-readable autoresearch log
│ └── models/
│ ├── champion/ ← Best model across all trials
│ └── trial-<N>/ ← Per-trial saved models
└── tests/
├── test_discretize_action.py
├── test_autoresearch_controller.py
└── test_runner_integration.py
Build & Test Commands
# Run all tests
cd /home/paulh/projects/donkeycar-rl-autoresearch
python3 -m pytest tests/ -v
# Run autoresearch controller (requires sim running on port 9091)
cd agent && python3 autoresearch_controller.py --trials 50
# Run single training trial manually
cd agent && python3 donkeycar_sb3_runner.py --agent ppo --timesteps 10000 --eval-episodes 5
# Check Gitea push
cd /home/paulh/projects/donkeycar-rl-autoresearch && git push
Coding Standards
- All output uses
flush=Truefor real-time log visibility - Every process must call
env.close()andtime.sleep(2)before exit (proven zombie prevention) - All results are appended to JSONL files — never overwritten
- Model saves use
model.save(path)from SB3 standard API - Champion model tracking: autoresearch writes
champion_model_pathto results JSONL - No
model.save()calls on undefined variables — always check model exists before saving - Python only — no TypeScript, no Node
3. Requirements
Functional Requirements
FR-001: Real RL Training in Inner Loop
Description: The inner RL runner (donkeycar_sb3_runner.py) must actually train a PPO or DQN model using model.learn(total_timesteps=N), not run random actions.
Acceptance criteria:
- Given
--agent ppo --timesteps 10000, the runner trains a PPO model for 10000 steps - Training uses the
learning_rateargument passed from the autoresearch controller - Training uses the discretized action space (n_steer, n_throttle) when DQN is used
- PPO runs with continuous actions (no discretization needed)
- Training completes without hanging and exits with code 0
FR-002: Model Saving
Description: After each training run, the trained model is saved to disk.
Acceptance criteria:
- Model saved to
agent/models/trial-<N>/model.zipafter every successful run - If eval reward is the best seen so far, model is also copied to
agent/models/champion/model.zip - Save path is logged to the JSONL results file
- Model can be loaded with
PPO.load()orDQN.load()for subsequent evaluation
FR-003: Real Policy Evaluation
Description: After training, the model is evaluated using the learned policy (not random actions).
Acceptance criteria:
evaluate_policy(model, env, n_eval_episodes=N)is used for evaluation- Mean reward and std reward are both recorded
- Evaluation uses the same action wrapper as training
- Per-episode rewards are printed for full observability
FR-004: Autoresearch GP+UCB Controller
Description: The autoresearch controller proposes hyperparameters using Gaussian Process + UCB acquisition, learning from prior results.
Acceptance criteria:
- Controller loads ALL prior results (base sweep + autoresearch history) at startup
- GP is fit on encoded (normalized) parameter vectors and corresponding eval rewards
- UCB acquisition = GP mean + kappa * GP std
- Next trial parameters maximize UCB over N_CANDIDATES random samples
- Controller logs top-5 UCB candidates before each trial
- Controller correctly handles first 2 trials (insufficient data for GP — uses random sampling)
FR-005: Champion Model Tracking
Description: The system maintains a single "champion" model — the best-performing model across all trials.
Acceptance criteria:
- After each trial, if
mean_reward > current_best, the model is saved as champion - Champion metadata (params, reward, trial number, timestamp) saved to
champion_manifest.json - Champion model path is stable:
agent/models/champion/model.zip - Champion can be loaded and demonstrated without retraining
FR-006: Speed-Aware Reward Shaping
Description: The reward function incentivizes speed, not just staying on track.
Acceptance criteria:
- Custom reward wrapper computes:
reward = speed * (1 - abs(cte) / max_cte) - Speed and CTE values are accessible from the DonkeyCar info dict
- Reward wrapper is optional (enabled via
--reward-shapingflag) - Without flag, default DonkeyCar reward is used unchanged
FR-007: Multi-Track Generalization Evaluation
Description: The champion model is evaluated on at least one track it was NOT trained on.
Acceptance criteria:
- Evaluation script accepts
--trackargument to specify evaluation track - Champion model is loaded and evaluated for N episodes on the specified track
- Results (mean_reward, per-episode rewards) are logged
- Generalization gap (train_reward - eval_reward) is reported
FR-008: Autoresearch Results Logging
Description: Every trial produces a complete, structured result record.
Acceptance criteria:
- JSONL record includes: trial_id, timestamp, params, mean_reward, std_reward, model_path, champion_flag, elapsed_sec, run_status
- Autoresearch log (human-readable) is updated after every trial
- Results file is never truncated — only appended
- Results are pushed to Gitea after every N trials (configurable, default 10)
FR-009: Unattended Overnight Operation
Description: The system runs for 100+ trials without hanging, zombie processes, or data loss.
Acceptance criteria:
- Every job calls
env.close()before exit - 2-second cooldown between jobs prevents race conditions
- Stale process kill (
pkill -9 -f donkeycar_sb3_runner.py) before each new job - 6-minute timeout per job — killed and logged if exceeded
- System auto-resumes from existing results if restarted mid-sweep
FR-010: Test Suite
Description: Core logic is covered by automated tests that don't require the simulator.
Acceptance criteria:
test_discretize_action.py— tests action space wrapping correctnesstest_autoresearch_controller.py— tests GP fitting, UCB computation, param encoding/decodingtest_runner_integration.py— mocked simulator test of training + save + eval cycle- All tests pass with
pytest tests/ -v - No tests require a running simulator
Non-Functional Requirements
NFR-001: Performance
- Each training trial completes in < 6 minutes for 10000 timesteps
- GP fitting on 300 data points completes in < 2 seconds
- System does not consume > 8GB RAM per trial
NFR-002: Robustness
- Zero hanging jobs across 100 consecutive trials
- All errors are caught, logged, and do not crash the autoresearch loop
- System correctly handles sim disconnection and logs the failure
NFR-003: Reproducibility
- All results are version-controlled in Gitea
- Every trial records the exact parameters used
- Results are deterministic given the same seed (seed support in runner)
NFR-004: Observability
- Real-time per-step reward printing during training and evaluation
- Per-trial summary logged to both console and file
- Running champion summary printed after every trial
4. Data Model
Trial Result Record (JSONL)
{
"trial": 42,
"timestamp": "2026-04-13T03:14:15.926535",
"params": {
"agent": "ppo",
"n_steer": 7,
"n_throttle": 3,
"learning_rate": 0.0003,
"timesteps": 10000,
"eval_episodes": 5,
"reward_shaping": false
},
"mean_reward": 127.45,
"std_reward": 18.3,
"model_path": "agent/models/trial-042/model.zip",
"champion": true,
"elapsed_sec": 187.4,
"run_status": "ok"
}
Champion Manifest (agent/models/champion/manifest.json)
{
"trial": 42,
"timestamp": "2026-04-13T03:14:15.926535",
"params": { "..." },
"mean_reward": 127.45,
"model_path": "agent/models/champion/model.zip"
}
GP State (in-memory, rebuilt each iteration from JSONL)
X: [N, n_params] normalized parameter vectors
y: [N] normalized mean rewards
GP: TinyGP fitted to (X, y)
5. Interface Design
Runner CLI (donkeycar_sb3_runner.py)
python3 donkeycar_sb3_runner.py \
--agent ppo|dqn \
--env donkey-generated-roads-v0 \
--timesteps 10000 \
--eval-episodes 5 \
--n-steer 7 \
--n-throttle 3 \
--learning-rate 0.0003 \
--save-dir agent/models/trial-042 \
--seed 42 \
--reward-shaping
Autoresearch Controller CLI
python3 autoresearch_controller.py \
--trials 100 \
--explore 2.0 \
--agent ppo \
--min-timesteps 5000 \
--max-timesteps 20000 \
--push-every 10
Evaluation / Demo CLI (evaluate_champion.py)
python3 evaluate_champion.py \
--model agent/models/champion/model.zip \
--env donkey-mountain-track-v0 \
--episodes 10
6. Architecture Decisions
Constraints
- MUST: Always call
env.close()before process exit - MUST: Save every trained model — never discard
- MUST: Use
evaluate_policy()from SB3 for evaluation — not a custom loop - MUST: Append to JSONL results — never overwrite
- MUST: All tests run without a live simulator
- MUST NOT: Use
model.save()beforemodelis defined - MUST NOT: Run random actions in production inner loop (this was the original bug)
- MUST NOT: Remove the 2-second cooldown between jobs
- PREFER: PPO over DQN for continuous driving tasks (better suited)
- PREFER: Pure numpy GP over sklearn to avoid dependency issues
- PREFER: Reward shaping enabled by default for speed optimization
- ESCALATE: If DonkeyCar gym API changes break env.reset() or env.step() signatures
- ESCALATE: If simulator port 9091 is unavailable at test time
- ESCALATE: If SB3 model save/load API changes between versions
Known Challenges
- Simulator must be running: All live training requires the DonkeyCar sim on port 9091. Tests must mock this.
- Episode length variance: Episodes end at 100 steps or CTE > 8. Mean reward has high variance across episodes.
- Random seed handling: DonkeyCar gym.reset() signature differs between Gym and Gymnasium versions.
- Model size: PPO models with CNN policy on 120x160x3 images can be large (>100MB). Consider git LFS or exclude from git.
Rejected Approaches
| Rejected option | Why rejected | Scope |
|---|---|---|
| Random action inner loop | Produces meaningless reward signal — cannot optimize for trained driving | project |
| sklearn GP | Adds sklearn dependency, compatibility issues found previously | project |
| DQN for continuous actions | DQN requires discretized actions, PPO handles continuous natively | project |
| Grid sweep as primary search | Fixed grid misses best regions; GP+UCB finds n_steer=8, n_throttle=5 which was not in grid | project |
| 100/200 trial arbitrary batches | No principled stopping criterion; should use convergence detection instead | project |
| model.save() from legacy training function | Was undefined — caused NameError crash on every run for entire history | project |
7. Phasing
Phase 1: Real Training Foundation (CURRENT — implement first)
Core goal: make the inner loop actually train and save models.
- Rebuild
donkeycar_sb3_runner.pywith real PPO/DQN training + save - Add speed-aware reward shaping wrapper
- Add proper
evaluate_policy()evaluation - Fix autoresearch controller to pass
learning_rateto runner - Add champion model tracking
- Write tests for all core logic
- Re-run autoresearch with real training (50 trials minimum)
Phase 2: Generalization (after Phase 1 champion exists)
Core goal: the champion model drives ANY track.
- Multi-track evaluation script
- Curriculum learning: train on 2+ tracks
- Domain randomization wrapper
- Convergence detection in autoresearch (stop when GP uncertainty collapses)
- Automatic Gitea push every N trials
Phase 3: Racing (after Phase 2 — generalization proven)
Core goal: fastest possible lap times.
- Lap time measurement and logging
- Reward function tuned for pure speed (with safety constraints)
- Fine-tuning from champion checkpoint on new tracks
- Head-to-head comparison: autoresearch champion vs human-tuned config
- Research paper / writeup structure
8. Reference Materials
External Docs
- DonkeyCar Gym: https://github.com/tawnkramer/gym-donkeycar
- Stable-Baselines3: https://stable-baselines3.readthedocs.io/
- Gymnasium migration: https://gymnasium.farama.org/introduction/migration_guide/
Existing Code to Learn From
agent/discretize_action.py— action space wrapper (working, tested in production)agent/autoresearch_controller.py— GP+UCB loop (working, needs inner loop fix)agent/outerloop-results/clean_sweep_results.jsonl— 18 records of base dataagent/outerloop-results/autoresearch_results.jsonl— 300 trial records (random policy — useful for discretization insights, NOT for learning_rate tuning)
Anti-patterns (DO NOT REPEAT)
- Calling
model.save()beforemodelis defined — crashes with NameError - Using
env.action_space.sample()in the "training" loop — this is random, not RL - Ignoring the
learning_rateargument in the runner (was passed but unused for 300 trials) - Arbitrary trial count limits — use convergence detection instead
- Not calling
env.close()— causes simulator zombie/hang
9. Evaluation Design
RL Eval Approach
Unlike software unit tests, RL reward is stochastic. Evaluation strategy:
- Run N_EVAL_EPISODES per trial (default 5)
- Record mean ± std reward
- Champion = highest mean reward across all trials
- Convergence = GP uncertainty (sigma) drops below threshold across all candidates
Test Cases (Simulator-Free)
TC-001: Action Space Encoding
Input: n_steer=5, n_throttle=3 → action index 7
Expected: Decoded to approximately (steer=0.0, throttle=0.5)
Verification: pytest tests/test_discretize_action.py::test_decode_action
TC-002: GP Fit and UCB Proposal
Input: 18 data points from clean_sweep_results.jsonl
Expected: GP proposes params with n_steer ∈ [6,9] and lr ∈ [0.001, 0.004] (the high-reward region identified in 300 trials)
Verification: pytest tests/test_autoresearch_controller.py::test_ucb_proposal_in_high_reward_region
TC-003: Param Encoding Round-Trip
Input: {'n_steer': 7, 'n_throttle': 3, 'learning_rate': 0.002}
Expected: encode → decode round-trip reproduces exact values (within int rounding)
Verification: pytest tests/test_autoresearch_controller.py::test_param_roundtrip
TC-004: Champion Tracking
Input: Trial sequence with rewards [50, 80, 60, 90, 70]
Expected: Champion is updated at trials 1, 2, 4 (rewards 50, 80, 90)
Verification: pytest tests/test_autoresearch_controller.py::test_champion_tracking
TC-005: Runner Exits Cleanly
Input: Mocked gym environment, 100 timesteps, PPO
Expected: Runner completes, calls env.close(), exits with code 0, model.zip exists
Verification: pytest tests/test_runner_integration.py::test_runner_exits_cleanly
Regression Baselines
Saved after Phase 1 completion:
best_params_after_300_random_trials.json— discretization insight baselinechampion_reward_phase1.txt— first real training champion reward