donkeycar-rl-autoresearch/PROJECT-SPEC.md

20 KiB

Project Specification — DonkeyCar RL Autoresearch

Version: 1.0.0
Date: 2026-04-13
Owner: paulh
Status: Active


1. Project Overview

What are we building?

An end-to-end autonomous research and training system for DonkeyCar reinforcement learning agents. The system:

  1. Trains DQN/PPO RL agents in the DonkeyCar simulator using Stable-Baselines3
  2. Saves the best-performing models to disk after every training run
  3. Uses a Gaussian Process + UCB Bayesian autoresearch controller to intelligently propose and evaluate new hyperparameter configurations — learning from every run
  4. Produces a champion model capable of driving a DonkeyCar on any track at maximum speed with minimum cross-track error

The project replaces manual hyperparameter tuning and random grid sweeps with a self-directing autoresearch loop that gets smarter with each trial.

Why does it matter?

Manual hyperparameter search for RL is slow, expensive, and non-systematic. The DonkeyCar task (fast, stable lap driving generalizable across tracks) requires careful tuning of the action space, reward function, and learning parameters. A Bayesian autoresearch loop:

  • Finds better configurations than grid search with fewer trials
  • Discovers non-obvious parameter regions (e.g., n_steer=8, n_throttle=5 emerged from autoresearch, not from the grid)
  • Creates a reproducible, logged, version-controlled research artifact
  • Enables unattended overnight experimentation with full observability

Success Criteria

  • Inner loop trains a real PPO/DQN model for a configurable number of timesteps and saves the best model to disk
  • Autoresearch controller proposes hyperparameters using GP+UCB and evaluates trained models (not random policy)
  • Champion model (highest eval reward across all trials) is saved separately and can be loaded for demonstration
  • Champion model can complete at least one lap on the training track with mean_reward > 100
  • Champion model generalizes to at least one unseen track (mean_reward > 50 on eval track)
  • All results are logged, versioned, and pushed to Gitea automatically
  • System can run unattended overnight with zero hangs or zombie processes
  • Full documentation exists: PRD, architecture, decisions, implementation plan, evals

2. Technical Foundation

Tech stack

  • Language: Python 3.10
  • RL Framework: Stable-Baselines3 (SB3) — PPO and DQN
  • Simulator: DonkeyCar Gym (gym_donkeycar) running locally on port 9091
  • Gym Interface: Gymnasium (gymnasium)
  • Surrogate Model: Pure numpy Gaussian Process (TinyGP — no sklearn required)
  • Action Wrapper: Custom DiscretizedActionWrapper (discretize_action.py)
  • Version Control: Git + Gitea (https://paje.ca/git/paulh/donkeycar-rl-autoresearch)
  • Test Framework: pytest
  • Logging: JSON Lines (JSONL) + human-readable log files

Project Structure

donkeycar-rl-autoresearch/
├── AGENT.md                            ← Agent instructions (this harness)
├── PROJECT-SPEC.md                     ← This file
├── DECISIONS.md                        ← Architecture Decision Records
├── IMPLEMENTATION_PLAN.md              ← Master task backlog
├── README.md                           ← Project overview
├── .gitignore
├── .harness/
│   ├── EXECUTION_MASTER.md             ← Wave/stream dashboard
│   ├── templates/                      ← Harness templates
│   ├── regression-baselines/           ← Saved eval baselines
│   └── <stream-name>/
│       ├── execution-board.md
│       ├── process-eval.md
│       └── validation/
├── agent/
│   ├── autoresearch_controller.py      ← GP+UCB autoresearch loop
│   ├── donkeycar_sb3_runner.py         ← Inner loop: real training + model save
│   ├── donkeycar_outer_loop.py         ← Grid sweep (legacy baseline)
│   ├── discretize_action.py            ← Action space wrapper
│   ├── outerloop-results/
│   │   ├── clean_sweep_results.jsonl   ← Base sweep data (18 records)
│   │   ├── autoresearch_results.jsonl  ← Autoresearch trial results
│   │   └── autoresearch_log.txt        ← Human-readable autoresearch log
│   └── models/
│       ├── champion/                   ← Best model across all trials
│       └── trial-<N>/                  ← Per-trial saved models
└── tests/
    ├── test_discretize_action.py
    ├── test_autoresearch_controller.py
    └── test_runner_integration.py

Build & Test Commands

# Run all tests
cd /home/paulh/projects/donkeycar-rl-autoresearch
python3 -m pytest tests/ -v

# Run autoresearch controller (requires sim running on port 9091)
cd agent && python3 autoresearch_controller.py --trials 50

# Run single training trial manually
cd agent && python3 donkeycar_sb3_runner.py --agent ppo --timesteps 10000 --eval-episodes 5

# Check Gitea push
cd /home/paulh/projects/donkeycar-rl-autoresearch && git push

Coding Standards

  • All output uses flush=True for real-time log visibility
  • Every process must call env.close() and time.sleep(2) before exit (proven zombie prevention)
  • All results are appended to JSONL files — never overwritten
  • Model saves use model.save(path) from SB3 standard API
  • Champion model tracking: autoresearch writes champion_model_path to results JSONL
  • No model.save() calls on undefined variables — always check model exists before saving
  • Python only — no TypeScript, no Node

3. Requirements

Functional Requirements

FR-001: Real RL Training in Inner Loop

Description: The inner RL runner (donkeycar_sb3_runner.py) must actually train a PPO or DQN model using model.learn(total_timesteps=N), not run random actions.
Acceptance criteria:

  • Given --agent ppo --timesteps 10000, the runner trains a PPO model for 10000 steps
  • Training uses the learning_rate argument passed from the autoresearch controller
  • Training uses the discretized action space (n_steer, n_throttle) when DQN is used
  • PPO runs with continuous actions (no discretization needed)
  • Training completes without hanging and exits with code 0

FR-002: Model Saving

Description: After each training run, the trained model is saved to disk.
Acceptance criteria:

  • Model saved to agent/models/trial-<N>/model.zip after every successful run
  • If eval reward is the best seen so far, model is also copied to agent/models/champion/model.zip
  • Save path is logged to the JSONL results file
  • Model can be loaded with PPO.load() or DQN.load() for subsequent evaluation

FR-003: Real Policy Evaluation

Description: After training, the model is evaluated using the learned policy (not random actions).
Acceptance criteria:

  • evaluate_policy(model, env, n_eval_episodes=N) is used for evaluation
  • Mean reward and std reward are both recorded
  • Evaluation uses the same action wrapper as training
  • Per-episode rewards are printed for full observability

FR-004: Autoresearch GP+UCB Controller

Description: The autoresearch controller proposes hyperparameters using Gaussian Process + UCB acquisition, learning from prior results.
Acceptance criteria:

  • Controller loads ALL prior results (base sweep + autoresearch history) at startup
  • GP is fit on encoded (normalized) parameter vectors and corresponding eval rewards
  • UCB acquisition = GP mean + kappa * GP std
  • Next trial parameters maximize UCB over N_CANDIDATES random samples
  • Controller logs top-5 UCB candidates before each trial
  • Controller correctly handles first 2 trials (insufficient data for GP — uses random sampling)

FR-005: Champion Model Tracking

Description: The system maintains a single "champion" model — the best-performing model across all trials.
Acceptance criteria:

  • After each trial, if mean_reward > current_best, the model is saved as champion
  • Champion metadata (params, reward, trial number, timestamp) saved to champion_manifest.json
  • Champion model path is stable: agent/models/champion/model.zip
  • Champion can be loaded and demonstrated without retraining

FR-006: Speed-Aware Reward Shaping

Description: The reward function incentivizes speed, not just staying on track.
Acceptance criteria:

  • Custom reward wrapper computes: reward = speed * (1 - abs(cte) / max_cte)
  • Speed and CTE values are accessible from the DonkeyCar info dict
  • Reward wrapper is optional (enabled via --reward-shaping flag)
  • Without flag, default DonkeyCar reward is used unchanged

FR-007: Multi-Track Generalization Evaluation

Description: The champion model is evaluated on at least one track it was NOT trained on.
Acceptance criteria:

  • Evaluation script accepts --track argument to specify evaluation track
  • Champion model is loaded and evaluated for N episodes on the specified track
  • Results (mean_reward, per-episode rewards) are logged
  • Generalization gap (train_reward - eval_reward) is reported

FR-008: Autoresearch Results Logging

Description: Every trial produces a complete, structured result record.
Acceptance criteria:

  • JSONL record includes: trial_id, timestamp, params, mean_reward, std_reward, model_path, champion_flag, elapsed_sec, run_status
  • Autoresearch log (human-readable) is updated after every trial
  • Results file is never truncated — only appended
  • Results are pushed to Gitea after every N trials (configurable, default 10)

FR-009: Unattended Overnight Operation

Description: The system runs for 100+ trials without hanging, zombie processes, or data loss.
Acceptance criteria:

  • Every job calls env.close() before exit
  • 2-second cooldown between jobs prevents race conditions
  • Stale process kill (pkill -9 -f donkeycar_sb3_runner.py) before each new job
  • 6-minute timeout per job — killed and logged if exceeded
  • System auto-resumes from existing results if restarted mid-sweep

FR-010: Test Suite

Description: Core logic is covered by automated tests that don't require the simulator.
Acceptance criteria:

  • test_discretize_action.py — tests action space wrapping correctness
  • test_autoresearch_controller.py — tests GP fitting, UCB computation, param encoding/decoding
  • test_runner_integration.py — mocked simulator test of training + save + eval cycle
  • All tests pass with pytest tests/ -v
  • No tests require a running simulator

Non-Functional Requirements

NFR-001: Performance

  • Each training trial completes in < 6 minutes for 10000 timesteps
  • GP fitting on 300 data points completes in < 2 seconds
  • System does not consume > 8GB RAM per trial

NFR-002: Robustness

  • Zero hanging jobs across 100 consecutive trials
  • All errors are caught, logged, and do not crash the autoresearch loop
  • System correctly handles sim disconnection and logs the failure

NFR-003: Reproducibility

  • All results are version-controlled in Gitea
  • Every trial records the exact parameters used
  • Results are deterministic given the same seed (seed support in runner)

NFR-004: Observability

  • Real-time per-step reward printing during training and evaluation
  • Per-trial summary logged to both console and file
  • Running champion summary printed after every trial

4. Data Model

Trial Result Record (JSONL)

{
  "trial": 42,
  "timestamp": "2026-04-13T03:14:15.926535",
  "params": {
    "agent": "ppo",
    "n_steer": 7,
    "n_throttle": 3,
    "learning_rate": 0.0003,
    "timesteps": 10000,
    "eval_episodes": 5,
    "reward_shaping": false
  },
  "mean_reward": 127.45,
  "std_reward": 18.3,
  "model_path": "agent/models/trial-042/model.zip",
  "champion": true,
  "elapsed_sec": 187.4,
  "run_status": "ok"
}

Champion Manifest (agent/models/champion/manifest.json)

{
  "trial": 42,
  "timestamp": "2026-04-13T03:14:15.926535",
  "params": { "..." },
  "mean_reward": 127.45,
  "model_path": "agent/models/champion/model.zip"
}

GP State (in-memory, rebuilt each iteration from JSONL)

X: [N, n_params]  normalized parameter vectors
y: [N]            normalized mean rewards
GP: TinyGP fitted to (X, y)

5. Interface Design

Runner CLI (donkeycar_sb3_runner.py)

python3 donkeycar_sb3_runner.py \
  --agent ppo|dqn \
  --env donkey-generated-roads-v0 \
  --timesteps 10000 \
  --eval-episodes 5 \
  --n-steer 7 \
  --n-throttle 3 \
  --learning-rate 0.0003 \
  --save-dir agent/models/trial-042 \
  --seed 42 \
  --reward-shaping

Autoresearch Controller CLI

python3 autoresearch_controller.py \
  --trials 100 \
  --explore 2.0 \
  --agent ppo \
  --min-timesteps 5000 \
  --max-timesteps 20000 \
  --push-every 10

Evaluation / Demo CLI (evaluate_champion.py)

python3 evaluate_champion.py \
  --model agent/models/champion/model.zip \
  --env donkey-mountain-track-v0 \
  --episodes 10

6. Architecture Decisions

Constraints

  • MUST: Always call env.close() before process exit
  • MUST: Save every trained model — never discard
  • MUST: Use evaluate_policy() from SB3 for evaluation — not a custom loop
  • MUST: Append to JSONL results — never overwrite
  • MUST: All tests run without a live simulator
  • MUST NOT: Use model.save() before model is defined
  • MUST NOT: Run random actions in production inner loop (this was the original bug)
  • MUST NOT: Remove the 2-second cooldown between jobs
  • PREFER: PPO over DQN for continuous driving tasks (better suited)
  • PREFER: Pure numpy GP over sklearn to avoid dependency issues
  • PREFER: Reward shaping enabled by default for speed optimization
  • ESCALATE: If DonkeyCar gym API changes break env.reset() or env.step() signatures
  • ESCALATE: If simulator port 9091 is unavailable at test time
  • ESCALATE: If SB3 model save/load API changes between versions

Known Challenges

  1. Simulator must be running: All live training requires the DonkeyCar sim on port 9091. Tests must mock this.
  2. Episode length variance: Episodes end at 100 steps or CTE > 8. Mean reward has high variance across episodes.
  3. Random seed handling: DonkeyCar gym.reset() signature differs between Gym and Gymnasium versions.
  4. Model size: PPO models with CNN policy on 120x160x3 images can be large (>100MB). Consider git LFS or exclude from git.

Rejected Approaches

Rejected option Why rejected Scope
Random action inner loop Produces meaningless reward signal — cannot optimize for trained driving project
sklearn GP Adds sklearn dependency, compatibility issues found previously project
DQN for continuous actions DQN requires discretized actions, PPO handles continuous natively project
Grid sweep as primary search Fixed grid misses best regions; GP+UCB finds n_steer=8, n_throttle=5 which was not in grid project
100/200 trial arbitrary batches No principled stopping criterion; should use convergence detection instead project
model.save() from legacy training function Was undefined — caused NameError crash on every run for entire history project

7. Phasing

Phase 1: Real Training Foundation (CURRENT — implement first)

Core goal: make the inner loop actually train and save models.

  • Rebuild donkeycar_sb3_runner.py with real PPO/DQN training + save
  • Add speed-aware reward shaping wrapper
  • Add proper evaluate_policy() evaluation
  • Fix autoresearch controller to pass learning_rate to runner
  • Add champion model tracking
  • Write tests for all core logic
  • Re-run autoresearch with real training (50 trials minimum)

Phase 2: Generalization (after Phase 1 champion exists)

Core goal: the champion model drives ANY track.

  • Multi-track evaluation script
  • Curriculum learning: train on 2+ tracks
  • Domain randomization wrapper
  • Convergence detection in autoresearch (stop when GP uncertainty collapses)
  • Automatic Gitea push every N trials

Phase 3: Racing (after Phase 2 — generalization proven)

Core goal: fastest possible lap times.

  • Lap time measurement and logging
  • Reward function tuned for pure speed (with safety constraints)
  • Fine-tuning from champion checkpoint on new tracks
  • Head-to-head comparison: autoresearch champion vs human-tuned config
  • Research paper / writeup structure

8. Reference Materials

External Docs

Existing Code to Learn From

  • agent/discretize_action.py — action space wrapper (working, tested in production)
  • agent/autoresearch_controller.py — GP+UCB loop (working, needs inner loop fix)
  • agent/outerloop-results/clean_sweep_results.jsonl — 18 records of base data
  • agent/outerloop-results/autoresearch_results.jsonl — 300 trial records (random policy — useful for discretization insights, NOT for learning_rate tuning)

Anti-patterns (DO NOT REPEAT)

  • Calling model.save() before model is defined — crashes with NameError
  • Using env.action_space.sample() in the "training" loop — this is random, not RL
  • Ignoring the learning_rate argument in the runner (was passed but unused for 300 trials)
  • Arbitrary trial count limits — use convergence detection instead
  • Not calling env.close() — causes simulator zombie/hang

9. Evaluation Design

RL Eval Approach

Unlike software unit tests, RL reward is stochastic. Evaluation strategy:

  • Run N_EVAL_EPISODES per trial (default 5)
  • Record mean ± std reward
  • Champion = highest mean reward across all trials
  • Convergence = GP uncertainty (sigma) drops below threshold across all candidates

Test Cases (Simulator-Free)

TC-001: Action Space Encoding

Input: n_steer=5, n_throttle=3 → action index 7
Expected: Decoded to approximately (steer=0.0, throttle=0.5)
Verification: pytest tests/test_discretize_action.py::test_decode_action

TC-002: GP Fit and UCB Proposal

Input: 18 data points from clean_sweep_results.jsonl
Expected: GP proposes params with n_steer ∈ [6,9] and lr ∈ [0.001, 0.004] (the high-reward region identified in 300 trials)
Verification: pytest tests/test_autoresearch_controller.py::test_ucb_proposal_in_high_reward_region

TC-003: Param Encoding Round-Trip

Input: {'n_steer': 7, 'n_throttle': 3, 'learning_rate': 0.002}
Expected: encode → decode round-trip reproduces exact values (within int rounding)
Verification: pytest tests/test_autoresearch_controller.py::test_param_roundtrip

TC-004: Champion Tracking

Input: Trial sequence with rewards [50, 80, 60, 90, 70]
Expected: Champion is updated at trials 1, 2, 4 (rewards 50, 80, 90)
Verification: pytest tests/test_autoresearch_controller.py::test_champion_tracking

TC-005: Runner Exits Cleanly

Input: Mocked gym environment, 100 timesteps, PPO
Expected: Runner completes, calls env.close(), exits with code 0, model.zip exists
Verification: pytest tests/test_runner_integration.py::test_runner_exits_cleanly

Regression Baselines

Saved after Phase 1 completion:

  • best_params_after_300_random_trials.json — discretization insight baseline
  • champion_reward_phase1.txt — first real training champion reward