456 lines
20 KiB
Markdown
456 lines
20 KiB
Markdown
# Project Specification — DonkeyCar RL Autoresearch
|
|
|
|
**Version:** 1.0.0
|
|
**Date:** 2026-04-13
|
|
**Owner:** paulh
|
|
**Status:** Active
|
|
|
|
---
|
|
|
|
## 1. Project Overview
|
|
|
|
### What are we building?
|
|
|
|
An end-to-end autonomous research and training system for DonkeyCar reinforcement learning agents. The system:
|
|
1. Trains DQN/PPO RL agents in the DonkeyCar simulator using Stable-Baselines3
|
|
2. Saves the best-performing models to disk after every training run
|
|
3. Uses a Gaussian Process + UCB Bayesian autoresearch controller to intelligently propose and evaluate new hyperparameter configurations — learning from every run
|
|
4. Produces a champion model capable of driving a DonkeyCar on any track at maximum speed with minimum cross-track error
|
|
|
|
The project replaces manual hyperparameter tuning and random grid sweeps with a self-directing autoresearch loop that gets smarter with each trial.
|
|
|
|
### Why does it matter?
|
|
|
|
Manual hyperparameter search for RL is slow, expensive, and non-systematic. The DonkeyCar task (fast, stable lap driving generalizable across tracks) requires careful tuning of the action space, reward function, and learning parameters. A Bayesian autoresearch loop:
|
|
- Finds better configurations than grid search with fewer trials
|
|
- Discovers non-obvious parameter regions (e.g., n_steer=8, n_throttle=5 emerged from autoresearch, not from the grid)
|
|
- Creates a reproducible, logged, version-controlled research artifact
|
|
- Enables unattended overnight experimentation with full observability
|
|
|
|
### Success Criteria
|
|
|
|
- [ ] Inner loop trains a real PPO/DQN model for a configurable number of timesteps and saves the best model to disk
|
|
- [ ] Autoresearch controller proposes hyperparameters using GP+UCB and evaluates trained models (not random policy)
|
|
- [ ] Champion model (highest eval reward across all trials) is saved separately and can be loaded for demonstration
|
|
- [ ] Champion model can complete at least one lap on the training track with mean_reward > 100
|
|
- [ ] Champion model generalizes to at least one unseen track (mean_reward > 50 on eval track)
|
|
- [ ] All results are logged, versioned, and pushed to Gitea automatically
|
|
- [ ] System can run unattended overnight with zero hangs or zombie processes
|
|
- [ ] Full documentation exists: PRD, architecture, decisions, implementation plan, evals
|
|
|
|
---
|
|
|
|
## 2. Technical Foundation
|
|
|
|
### Tech stack
|
|
|
|
- **Language:** Python 3.10
|
|
- **RL Framework:** Stable-Baselines3 (SB3) — PPO and DQN
|
|
- **Simulator:** DonkeyCar Gym (gym_donkeycar) running locally on port 9091
|
|
- **Gym Interface:** Gymnasium (gymnasium)
|
|
- **Surrogate Model:** Pure numpy Gaussian Process (TinyGP — no sklearn required)
|
|
- **Action Wrapper:** Custom DiscretizedActionWrapper (discretize_action.py)
|
|
- **Version Control:** Git + Gitea (https://paje.ca/git/paulh/donkeycar-rl-autoresearch)
|
|
- **Test Framework:** pytest
|
|
- **Logging:** JSON Lines (JSONL) + human-readable log files
|
|
|
|
### Project Structure
|
|
|
|
```
|
|
donkeycar-rl-autoresearch/
|
|
├── AGENT.md ← Agent instructions (this harness)
|
|
├── PROJECT-SPEC.md ← This file
|
|
├── DECISIONS.md ← Architecture Decision Records
|
|
├── IMPLEMENTATION_PLAN.md ← Master task backlog
|
|
├── README.md ← Project overview
|
|
├── .gitignore
|
|
├── .harness/
|
|
│ ├── EXECUTION_MASTER.md ← Wave/stream dashboard
|
|
│ ├── templates/ ← Harness templates
|
|
│ ├── regression-baselines/ ← Saved eval baselines
|
|
│ └── <stream-name>/
|
|
│ ├── execution-board.md
|
|
│ ├── process-eval.md
|
|
│ └── validation/
|
|
├── agent/
|
|
│ ├── autoresearch_controller.py ← GP+UCB autoresearch loop
|
|
│ ├── donkeycar_sb3_runner.py ← Inner loop: real training + model save
|
|
│ ├── donkeycar_outer_loop.py ← Grid sweep (legacy baseline)
|
|
│ ├── discretize_action.py ← Action space wrapper
|
|
│ ├── outerloop-results/
|
|
│ │ ├── clean_sweep_results.jsonl ← Base sweep data (18 records)
|
|
│ │ ├── autoresearch_results.jsonl ← Autoresearch trial results
|
|
│ │ └── autoresearch_log.txt ← Human-readable autoresearch log
|
|
│ └── models/
|
|
│ ├── champion/ ← Best model across all trials
|
|
│ └── trial-<N>/ ← Per-trial saved models
|
|
└── tests/
|
|
├── test_discretize_action.py
|
|
├── test_autoresearch_controller.py
|
|
└── test_runner_integration.py
|
|
```
|
|
|
|
### Build & Test Commands
|
|
|
|
```bash
|
|
# Run all tests
|
|
cd /home/paulh/projects/donkeycar-rl-autoresearch
|
|
python3 -m pytest tests/ -v
|
|
|
|
# Run autoresearch controller (requires sim running on port 9091)
|
|
cd agent && python3 autoresearch_controller.py --trials 50
|
|
|
|
# Run single training trial manually
|
|
cd agent && python3 donkeycar_sb3_runner.py --agent ppo --timesteps 10000 --eval-episodes 5
|
|
|
|
# Check Gitea push
|
|
cd /home/paulh/projects/donkeycar-rl-autoresearch && git push
|
|
```
|
|
|
|
### Coding Standards
|
|
|
|
- All output uses `flush=True` for real-time log visibility
|
|
- Every process must call `env.close()` and `time.sleep(2)` before exit (proven zombie prevention)
|
|
- All results are appended to JSONL files — never overwritten
|
|
- Model saves use `model.save(path)` from SB3 standard API
|
|
- Champion model tracking: autoresearch writes `champion_model_path` to results JSONL
|
|
- No `model.save()` calls on undefined variables — always check model exists before saving
|
|
- Python only — no TypeScript, no Node
|
|
|
|
---
|
|
|
|
## 3. Requirements
|
|
|
|
### Functional Requirements
|
|
|
|
#### FR-001: Real RL Training in Inner Loop
|
|
**Description:** The inner RL runner (`donkeycar_sb3_runner.py`) must actually train a PPO or DQN model using `model.learn(total_timesteps=N)`, not run random actions.
|
|
**Acceptance criteria:**
|
|
- [ ] Given `--agent ppo --timesteps 10000`, the runner trains a PPO model for 10000 steps
|
|
- [ ] Training uses the `learning_rate` argument passed from the autoresearch controller
|
|
- [ ] Training uses the discretized action space (n_steer, n_throttle) when DQN is used
|
|
- [ ] PPO runs with continuous actions (no discretization needed)
|
|
- [ ] Training completes without hanging and exits with code 0
|
|
|
|
#### FR-002: Model Saving
|
|
**Description:** After each training run, the trained model is saved to disk.
|
|
**Acceptance criteria:**
|
|
- [ ] Model saved to `agent/models/trial-<N>/model.zip` after every successful run
|
|
- [ ] If eval reward is the best seen so far, model is also copied to `agent/models/champion/model.zip`
|
|
- [ ] Save path is logged to the JSONL results file
|
|
- [ ] Model can be loaded with `PPO.load()` or `DQN.load()` for subsequent evaluation
|
|
|
|
#### FR-003: Real Policy Evaluation
|
|
**Description:** After training, the model is evaluated using the learned policy (not random actions).
|
|
**Acceptance criteria:**
|
|
- [ ] `evaluate_policy(model, env, n_eval_episodes=N)` is used for evaluation
|
|
- [ ] Mean reward and std reward are both recorded
|
|
- [ ] Evaluation uses the same action wrapper as training
|
|
- [ ] Per-episode rewards are printed for full observability
|
|
|
|
#### FR-004: Autoresearch GP+UCB Controller
|
|
**Description:** The autoresearch controller proposes hyperparameters using Gaussian Process + UCB acquisition, learning from prior results.
|
|
**Acceptance criteria:**
|
|
- [ ] Controller loads ALL prior results (base sweep + autoresearch history) at startup
|
|
- [ ] GP is fit on encoded (normalized) parameter vectors and corresponding eval rewards
|
|
- [ ] UCB acquisition = GP mean + kappa * GP std
|
|
- [ ] Next trial parameters maximize UCB over N_CANDIDATES random samples
|
|
- [ ] Controller logs top-5 UCB candidates before each trial
|
|
- [ ] Controller correctly handles first 2 trials (insufficient data for GP — uses random sampling)
|
|
|
|
#### FR-005: Champion Model Tracking
|
|
**Description:** The system maintains a single "champion" model — the best-performing model across all trials.
|
|
**Acceptance criteria:**
|
|
- [ ] After each trial, if `mean_reward > current_best`, the model is saved as champion
|
|
- [ ] Champion metadata (params, reward, trial number, timestamp) saved to `champion_manifest.json`
|
|
- [ ] Champion model path is stable: `agent/models/champion/model.zip`
|
|
- [ ] Champion can be loaded and demonstrated without retraining
|
|
|
|
#### FR-006: Speed-Aware Reward Shaping
|
|
**Description:** The reward function incentivizes speed, not just staying on track.
|
|
**Acceptance criteria:**
|
|
- [ ] Custom reward wrapper computes: `reward = speed * (1 - abs(cte) / max_cte)`
|
|
- [ ] Speed and CTE values are accessible from the DonkeyCar info dict
|
|
- [ ] Reward wrapper is optional (enabled via `--reward-shaping` flag)
|
|
- [ ] Without flag, default DonkeyCar reward is used unchanged
|
|
|
|
#### FR-007: Multi-Track Generalization Evaluation
|
|
**Description:** The champion model is evaluated on at least one track it was NOT trained on.
|
|
**Acceptance criteria:**
|
|
- [ ] Evaluation script accepts `--track` argument to specify evaluation track
|
|
- [ ] Champion model is loaded and evaluated for N episodes on the specified track
|
|
- [ ] Results (mean_reward, per-episode rewards) are logged
|
|
- [ ] Generalization gap (train_reward - eval_reward) is reported
|
|
|
|
#### FR-008: Autoresearch Results Logging
|
|
**Description:** Every trial produces a complete, structured result record.
|
|
**Acceptance criteria:**
|
|
- [ ] JSONL record includes: trial_id, timestamp, params, mean_reward, std_reward, model_path, champion_flag, elapsed_sec, run_status
|
|
- [ ] Autoresearch log (human-readable) is updated after every trial
|
|
- [ ] Results file is never truncated — only appended
|
|
- [ ] Results are pushed to Gitea after every N trials (configurable, default 10)
|
|
|
|
#### FR-009: Unattended Overnight Operation
|
|
**Description:** The system runs for 100+ trials without hanging, zombie processes, or data loss.
|
|
**Acceptance criteria:**
|
|
- [ ] Every job calls `env.close()` before exit
|
|
- [ ] 2-second cooldown between jobs prevents race conditions
|
|
- [ ] Stale process kill (`pkill -9 -f donkeycar_sb3_runner.py`) before each new job
|
|
- [ ] 6-minute timeout per job — killed and logged if exceeded
|
|
- [ ] System auto-resumes from existing results if restarted mid-sweep
|
|
|
|
#### FR-010: Test Suite
|
|
**Description:** Core logic is covered by automated tests that don't require the simulator.
|
|
**Acceptance criteria:**
|
|
- [ ] `test_discretize_action.py` — tests action space wrapping correctness
|
|
- [ ] `test_autoresearch_controller.py` — tests GP fitting, UCB computation, param encoding/decoding
|
|
- [ ] `test_runner_integration.py` — mocked simulator test of training + save + eval cycle
|
|
- [ ] All tests pass with `pytest tests/ -v`
|
|
- [ ] No tests require a running simulator
|
|
|
|
### Non-Functional Requirements
|
|
|
|
#### NFR-001: Performance
|
|
- [ ] Each training trial completes in < 6 minutes for 10000 timesteps
|
|
- [ ] GP fitting on 300 data points completes in < 2 seconds
|
|
- [ ] System does not consume > 8GB RAM per trial
|
|
|
|
#### NFR-002: Robustness
|
|
- [ ] Zero hanging jobs across 100 consecutive trials
|
|
- [ ] All errors are caught, logged, and do not crash the autoresearch loop
|
|
- [ ] System correctly handles sim disconnection and logs the failure
|
|
|
|
#### NFR-003: Reproducibility
|
|
- [ ] All results are version-controlled in Gitea
|
|
- [ ] Every trial records the exact parameters used
|
|
- [ ] Results are deterministic given the same seed (seed support in runner)
|
|
|
|
#### NFR-004: Observability
|
|
- [ ] Real-time per-step reward printing during training and evaluation
|
|
- [ ] Per-trial summary logged to both console and file
|
|
- [ ] Running champion summary printed after every trial
|
|
|
|
---
|
|
|
|
## 4. Data Model
|
|
|
|
### Trial Result Record (JSONL)
|
|
|
|
```json
|
|
{
|
|
"trial": 42,
|
|
"timestamp": "2026-04-13T03:14:15.926535",
|
|
"params": {
|
|
"agent": "ppo",
|
|
"n_steer": 7,
|
|
"n_throttle": 3,
|
|
"learning_rate": 0.0003,
|
|
"timesteps": 10000,
|
|
"eval_episodes": 5,
|
|
"reward_shaping": false
|
|
},
|
|
"mean_reward": 127.45,
|
|
"std_reward": 18.3,
|
|
"model_path": "agent/models/trial-042/model.zip",
|
|
"champion": true,
|
|
"elapsed_sec": 187.4,
|
|
"run_status": "ok"
|
|
}
|
|
```
|
|
|
|
### Champion Manifest (`agent/models/champion/manifest.json`)
|
|
|
|
```json
|
|
{
|
|
"trial": 42,
|
|
"timestamp": "2026-04-13T03:14:15.926535",
|
|
"params": { "..." },
|
|
"mean_reward": 127.45,
|
|
"model_path": "agent/models/champion/model.zip"
|
|
}
|
|
```
|
|
|
|
### GP State (in-memory, rebuilt each iteration from JSONL)
|
|
|
|
```
|
|
X: [N, n_params] normalized parameter vectors
|
|
y: [N] normalized mean rewards
|
|
GP: TinyGP fitted to (X, y)
|
|
```
|
|
|
|
---
|
|
|
|
## 5. Interface Design
|
|
|
|
### Runner CLI (`donkeycar_sb3_runner.py`)
|
|
|
|
```bash
|
|
python3 donkeycar_sb3_runner.py \
|
|
--agent ppo|dqn \
|
|
--env donkey-generated-roads-v0 \
|
|
--timesteps 10000 \
|
|
--eval-episodes 5 \
|
|
--n-steer 7 \
|
|
--n-throttle 3 \
|
|
--learning-rate 0.0003 \
|
|
--save-dir agent/models/trial-042 \
|
|
--seed 42 \
|
|
--reward-shaping
|
|
```
|
|
|
|
### Autoresearch Controller CLI
|
|
|
|
```bash
|
|
python3 autoresearch_controller.py \
|
|
--trials 100 \
|
|
--explore 2.0 \
|
|
--agent ppo \
|
|
--min-timesteps 5000 \
|
|
--max-timesteps 20000 \
|
|
--push-every 10
|
|
```
|
|
|
|
### Evaluation / Demo CLI (`evaluate_champion.py`)
|
|
|
|
```bash
|
|
python3 evaluate_champion.py \
|
|
--model agent/models/champion/model.zip \
|
|
--env donkey-mountain-track-v0 \
|
|
--episodes 10
|
|
```
|
|
|
|
---
|
|
|
|
## 6. Architecture Decisions
|
|
|
|
### Constraints
|
|
|
|
- **MUST:** Always call `env.close()` before process exit
|
|
- **MUST:** Save every trained model — never discard
|
|
- **MUST:** Use `evaluate_policy()` from SB3 for evaluation — not a custom loop
|
|
- **MUST:** Append to JSONL results — never overwrite
|
|
- **MUST:** All tests run without a live simulator
|
|
- **MUST NOT:** Use `model.save()` before `model` is defined
|
|
- **MUST NOT:** Run random actions in production inner loop (this was the original bug)
|
|
- **MUST NOT:** Remove the 2-second cooldown between jobs
|
|
- **PREFER:** PPO over DQN for continuous driving tasks (better suited)
|
|
- **PREFER:** Pure numpy GP over sklearn to avoid dependency issues
|
|
- **PREFER:** Reward shaping enabled by default for speed optimization
|
|
- **ESCALATE:** If DonkeyCar gym API changes break env.reset() or env.step() signatures
|
|
- **ESCALATE:** If simulator port 9091 is unavailable at test time
|
|
- **ESCALATE:** If SB3 model save/load API changes between versions
|
|
|
|
### Known Challenges
|
|
|
|
1. **Simulator must be running:** All live training requires the DonkeyCar sim on port 9091. Tests must mock this.
|
|
2. **Episode length variance:** Episodes end at 100 steps or CTE > 8. Mean reward has high variance across episodes.
|
|
3. **Random seed handling:** DonkeyCar gym.reset() signature differs between Gym and Gymnasium versions.
|
|
4. **Model size:** PPO models with CNN policy on 120x160x3 images can be large (>100MB). Consider git LFS or exclude from git.
|
|
|
|
### Rejected Approaches
|
|
|
|
| Rejected option | Why rejected | Scope |
|
|
|-----------------|-------------|-------|
|
|
| Random action inner loop | Produces meaningless reward signal — cannot optimize for trained driving | project |
|
|
| sklearn GP | Adds sklearn dependency, compatibility issues found previously | project |
|
|
| DQN for continuous actions | DQN requires discretized actions, PPO handles continuous natively | project |
|
|
| Grid sweep as primary search | Fixed grid misses best regions; GP+UCB finds n_steer=8, n_throttle=5 which was not in grid | project |
|
|
| 100/200 trial arbitrary batches | No principled stopping criterion; should use convergence detection instead | project |
|
|
| model.save() from legacy training function | Was undefined — caused NameError crash on every run for entire history | project |
|
|
|
|
---
|
|
|
|
## 7. Phasing
|
|
|
|
### Phase 1: Real Training Foundation (CURRENT — implement first)
|
|
Core goal: make the inner loop actually train and save models.
|
|
- [ ] Rebuild `donkeycar_sb3_runner.py` with real PPO/DQN training + save
|
|
- [ ] Add speed-aware reward shaping wrapper
|
|
- [ ] Add proper `evaluate_policy()` evaluation
|
|
- [ ] Fix autoresearch controller to pass `learning_rate` to runner
|
|
- [ ] Add champion model tracking
|
|
- [ ] Write tests for all core logic
|
|
- [ ] Re-run autoresearch with real training (50 trials minimum)
|
|
|
|
### Phase 2: Generalization (after Phase 1 champion exists)
|
|
Core goal: the champion model drives ANY track.
|
|
- [ ] Multi-track evaluation script
|
|
- [ ] Curriculum learning: train on 2+ tracks
|
|
- [ ] Domain randomization wrapper
|
|
- [ ] Convergence detection in autoresearch (stop when GP uncertainty collapses)
|
|
- [ ] Automatic Gitea push every N trials
|
|
|
|
### Phase 3: Racing (after Phase 2 — generalization proven)
|
|
Core goal: fastest possible lap times.
|
|
- [ ] Lap time measurement and logging
|
|
- [ ] Reward function tuned for pure speed (with safety constraints)
|
|
- [ ] Fine-tuning from champion checkpoint on new tracks
|
|
- [ ] Head-to-head comparison: autoresearch champion vs human-tuned config
|
|
- [ ] Research paper / writeup structure
|
|
|
|
---
|
|
|
|
## 8. Reference Materials
|
|
|
|
### External Docs
|
|
- DonkeyCar Gym: https://github.com/tawnkramer/gym-donkeycar
|
|
- Stable-Baselines3: https://stable-baselines3.readthedocs.io/
|
|
- Gymnasium migration: https://gymnasium.farama.org/introduction/migration_guide/
|
|
|
|
### Existing Code to Learn From
|
|
- `agent/discretize_action.py` — action space wrapper (working, tested in production)
|
|
- `agent/autoresearch_controller.py` — GP+UCB loop (working, needs inner loop fix)
|
|
- `agent/outerloop-results/clean_sweep_results.jsonl` — 18 records of base data
|
|
- `agent/outerloop-results/autoresearch_results.jsonl` — 300 trial records (random policy — useful for discretization insights, NOT for learning_rate tuning)
|
|
|
|
### Anti-patterns (DO NOT REPEAT)
|
|
- Calling `model.save()` before `model` is defined — crashes with NameError
|
|
- Using `env.action_space.sample()` in the "training" loop — this is random, not RL
|
|
- Ignoring the `learning_rate` argument in the runner (was passed but unused for 300 trials)
|
|
- Arbitrary trial count limits — use convergence detection instead
|
|
- Not calling `env.close()` — causes simulator zombie/hang
|
|
|
|
---
|
|
|
|
## 9. Evaluation Design
|
|
|
|
### RL Eval Approach
|
|
|
|
Unlike software unit tests, RL reward is stochastic. Evaluation strategy:
|
|
- Run N_EVAL_EPISODES per trial (default 5)
|
|
- Record mean ± std reward
|
|
- Champion = highest mean reward across all trials
|
|
- Convergence = GP uncertainty (sigma) drops below threshold across all candidates
|
|
|
|
### Test Cases (Simulator-Free)
|
|
|
|
#### TC-001: Action Space Encoding
|
|
**Input:** n_steer=5, n_throttle=3 → action index 7
|
|
**Expected:** Decoded to approximately (steer=0.0, throttle=0.5)
|
|
**Verification:** `pytest tests/test_discretize_action.py::test_decode_action`
|
|
|
|
#### TC-002: GP Fit and UCB Proposal
|
|
**Input:** 18 data points from clean_sweep_results.jsonl
|
|
**Expected:** GP proposes params with n_steer ∈ [6,9] and lr ∈ [0.001, 0.004] (the high-reward region identified in 300 trials)
|
|
**Verification:** `pytest tests/test_autoresearch_controller.py::test_ucb_proposal_in_high_reward_region`
|
|
|
|
#### TC-003: Param Encoding Round-Trip
|
|
**Input:** `{'n_steer': 7, 'n_throttle': 3, 'learning_rate': 0.002}`
|
|
**Expected:** encode → decode round-trip reproduces exact values (within int rounding)
|
|
**Verification:** `pytest tests/test_autoresearch_controller.py::test_param_roundtrip`
|
|
|
|
#### TC-004: Champion Tracking
|
|
**Input:** Trial sequence with rewards [50, 80, 60, 90, 70]
|
|
**Expected:** Champion is updated at trials 1, 2, 4 (rewards 50, 80, 90)
|
|
**Verification:** `pytest tests/test_autoresearch_controller.py::test_champion_tracking`
|
|
|
|
#### TC-005: Runner Exits Cleanly
|
|
**Input:** Mocked gym environment, 100 timesteps, PPO
|
|
**Expected:** Runner completes, calls env.close(), exits with code 0, model.zip exists
|
|
**Verification:** `pytest tests/test_runner_integration.py::test_runner_exits_cleanly`
|
|
|
|
### Regression Baselines
|
|
Saved after Phase 1 completion:
|
|
- `best_params_after_300_random_trials.json` — discretization insight baseline
|
|
- `champion_reward_phase1.txt` — first real training champion reward
|