20 KiB

Raw Blame History

Project Specification — DonkeyCar RL Autoresearch

Version: 1.0.0
Date: 2026-04-13
Owner: paulh
Status: Active

1. Project Overview

What are we building?

An end-to-end autonomous research and training system for DonkeyCar reinforcement learning agents. The system:

Trains DQN/PPO RL agents in the DonkeyCar simulator using Stable-Baselines3
Saves the best-performing models to disk after every training run
Uses a Gaussian Process + UCB Bayesian autoresearch controller to intelligently propose and evaluate new hyperparameter configurations — learning from every run
Produces a champion model capable of driving a DonkeyCar on any track at maximum speed with minimum cross-track error

The project replaces manual hyperparameter tuning and random grid sweeps with a self-directing autoresearch loop that gets smarter with each trial.

Why does it matter?

Manual hyperparameter search for RL is slow, expensive, and non-systematic. The DonkeyCar task (fast, stable lap driving generalizable across tracks) requires careful tuning of the action space, reward function, and learning parameters. A Bayesian autoresearch loop:

Finds better configurations than grid search with fewer trials
Discovers non-obvious parameter regions (e.g., n_steer=8, n_throttle=5 emerged from autoresearch, not from the grid)
Creates a reproducible, logged, version-controlled research artifact
Enables unattended overnight experimentation with full observability

Success Criteria

Inner loop trains a real PPO/DQN model for a configurable number of timesteps and saves the best model to disk
Autoresearch controller proposes hyperparameters using GP+UCB and evaluates trained models (not random policy)
Champion model (highest eval reward across all trials) is saved separately and can be loaded for demonstration
Champion model can complete at least one lap on the training track with mean_reward > 100
Champion model generalizes to at least one unseen track (mean_reward > 50 on eval track)
All results are logged, versioned, and pushed to Gitea automatically
System can run unattended overnight with zero hangs or zombie processes
Full documentation exists: PRD, architecture, decisions, implementation plan, evals

2. Technical Foundation

Tech stack

Language: Python 3.10
RL Framework: Stable-Baselines3 (SB3) — PPO and DQN
Simulator: DonkeyCar Gym (gym_donkeycar) running locally on port 9091
Gym Interface: Gymnasium (gymnasium)
Surrogate Model: Pure numpy Gaussian Process (TinyGP — no sklearn required)
Action Wrapper: Custom DiscretizedActionWrapper (discretize_action.py)
Version Control: Git + Gitea (https://paje.ca/git/paulh/donkeycar-rl-autoresearch)
Test Framework: pytest
Logging: JSON Lines (JSONL) + human-readable log files

Project Structure

donkeycar-rl-autoresearch/
├── AGENT.md                            ← Agent instructions (this harness)
├── PROJECT-SPEC.md                     ← This file
├── DECISIONS.md                        ← Architecture Decision Records
├── IMPLEMENTATION_PLAN.md              ← Master task backlog
├── README.md                           ← Project overview
├── .gitignore
├── .harness/
│   ├── EXECUTION_MASTER.md             ← Wave/stream dashboard
│   ├── templates/                      ← Harness templates
│   ├── regression-baselines/           ← Saved eval baselines
│   └── <stream-name>/
│       ├── execution-board.md
│       ├── process-eval.md
│       └── validation/
├── agent/
│   ├── autoresearch_controller.py      ← GP+UCB autoresearch loop
│   ├── donkeycar_sb3_runner.py         ← Inner loop: real training + model save
│   ├── donkeycar_outer_loop.py         ← Grid sweep (legacy baseline)
│   ├── discretize_action.py            ← Action space wrapper
│   ├── outerloop-results/
│   │   ├── clean_sweep_results.jsonl   ← Base sweep data (18 records)
│   │   ├── autoresearch_results.jsonl  ← Autoresearch trial results
│   │   └── autoresearch_log.txt        ← Human-readable autoresearch log
│   └── models/
│       ├── champion/                   ← Best model across all trials
│       └── trial-<N>/                  ← Per-trial saved models
└── tests/
    ├── test_discretize_action.py
    ├── test_autoresearch_controller.py
    └── test_runner_integration.py

Build & Test Commands

# Run all tests
cd /home/paulh/projects/donkeycar-rl-autoresearch
python3 -m pytest tests/ -v

# Run autoresearch controller (requires sim running on port 9091)
cd agent && python3 autoresearch_controller.py --trials 50

# Run single training trial manually
cd agent && python3 donkeycar_sb3_runner.py --agent ppo --timesteps 10000 --eval-episodes 5

# Check Gitea push
cd /home/paulh/projects/donkeycar-rl-autoresearch && git push

Coding Standards

All output uses flush=True for real-time log visibility
Every process must call env.close() and time.sleep(2) before exit (proven zombie prevention)
All results are appended to JSONL files — never overwritten
Model saves use model.save(path) from SB3 standard API
Champion model tracking: autoresearch writes champion_model_path to results JSONL
No model.save() calls on undefined variables — always check model exists before saving
Python only — no TypeScript, no Node

3. Requirements

Functional Requirements

FR-001: Real RL Training in Inner Loop

Description: The inner RL runner (donkeycar_sb3_runner.py) must actually train a PPO or DQN model using model.learn(total_timesteps=N), not run random actions.
Acceptance criteria:

Given --agent ppo --timesteps 10000, the runner trains a PPO model for 10000 steps
Training uses the learning_rate argument passed from the autoresearch controller
Training uses the discretized action space (n_steer, n_throttle) when DQN is used
PPO runs with continuous actions (no discretization needed)
Training completes without hanging and exits with code 0

FR-002: Model Saving

Description: After each training run, the trained model is saved to disk.
Acceptance criteria:

Model saved to agent/models/trial-<N>/model.zip after every successful run
If eval reward is the best seen so far, model is also copied to agent/models/champion/model.zip
Save path is logged to the JSONL results file
Model can be loaded with PPO.load() or DQN.load() for subsequent evaluation

FR-003: Real Policy Evaluation

Description: After training, the model is evaluated using the learned policy (not random actions).
Acceptance criteria:

evaluate_policy(model, env, n_eval_episodes=N) is used for evaluation
Mean reward and std reward are both recorded
Evaluation uses the same action wrapper as training
Per-episode rewards are printed for full observability

FR-004: Autoresearch GP+UCB Controller

Description: The autoresearch controller proposes hyperparameters using Gaussian Process + UCB acquisition, learning from prior results.
Acceptance criteria:

Controller loads ALL prior results (base sweep + autoresearch history) at startup
GP is fit on encoded (normalized) parameter vectors and corresponding eval rewards
UCB acquisition = GP mean + kappa * GP std
Next trial parameters maximize UCB over N_CANDIDATES random samples
Controller logs top-5 UCB candidates before each trial
Controller correctly handles first 2 trials (insufficient data for GP — uses random sampling)

FR-005: Champion Model Tracking

Description: The system maintains a single "champion" model — the best-performing model across all trials.
Acceptance criteria:

After each trial, if mean_reward > current_best, the model is saved as champion
Champion metadata (params, reward, trial number, timestamp) saved to champion_manifest.json
Champion model path is stable: agent/models/champion/model.zip
Champion can be loaded and demonstrated without retraining

FR-006: Speed-Aware Reward Shaping

Description: The reward function incentivizes speed, not just staying on track.
Acceptance criteria:

Custom reward wrapper computes: reward = speed * (1 - abs(cte) / max_cte)
Speed and CTE values are accessible from the DonkeyCar info dict
Reward wrapper is optional (enabled via --reward-shaping flag)
Without flag, default DonkeyCar reward is used unchanged

FR-007: Multi-Track Generalization Evaluation

Description: The champion model is evaluated on at least one track it was NOT trained on.
Acceptance criteria:

Evaluation script accepts --track argument to specify evaluation track
Champion model is loaded and evaluated for N episodes on the specified track
Results (mean_reward, per-episode rewards) are logged
Generalization gap (train_reward - eval_reward) is reported

FR-008: Autoresearch Results Logging

Description: Every trial produces a complete, structured result record.
Acceptance criteria:

JSONL record includes: trial_id, timestamp, params, mean_reward, std_reward, model_path, champion_flag, elapsed_sec, run_status
Autoresearch log (human-readable) is updated after every trial
Results file is never truncated — only appended
Results are pushed to Gitea after every N trials (configurable, default 10)

FR-009: Unattended Overnight Operation

Description: The system runs for 100+ trials without hanging, zombie processes, or data loss.
Acceptance criteria:

Every job calls env.close() before exit
2-second cooldown between jobs prevents race conditions
Stale process kill (pkill -9 -f donkeycar_sb3_runner.py) before each new job
6-minute timeout per job — killed and logged if exceeded
System auto-resumes from existing results if restarted mid-sweep

FR-010: Test Suite

Description: Core logic is covered by automated tests that don't require the simulator.
Acceptance criteria:

test_discretize_action.py — tests action space wrapping correctness
test_autoresearch_controller.py — tests GP fitting, UCB computation, param encoding/decoding
test_runner_integration.py — mocked simulator test of training + save + eval cycle
All tests pass with pytest tests/ -v
No tests require a running simulator

Non-Functional Requirements

NFR-001: Performance

Each training trial completes in < 6 minutes for 10000 timesteps
GP fitting on 300 data points completes in < 2 seconds
System does not consume > 8GB RAM per trial

NFR-002: Robustness

Zero hanging jobs across 100 consecutive trials
All errors are caught, logged, and do not crash the autoresearch loop
System correctly handles sim disconnection and logs the failure

NFR-003: Reproducibility

All results are version-controlled in Gitea
Every trial records the exact parameters used
Results are deterministic given the same seed (seed support in runner)

NFR-004: Observability

Real-time per-step reward printing during training and evaluation
Per-trial summary logged to both console and file
Running champion summary printed after every trial

4. Data Model

Trial Result Record (JSONL)

{
  "trial": 42,
  "timestamp": "2026-04-13T03:14:15.926535",
  "params": {
    "agent": "ppo",
    "n_steer": 7,
    "n_throttle": 3,
    "learning_rate": 0.0003,
    "timesteps": 10000,
    "eval_episodes": 5,
    "reward_shaping": false
  },
  "mean_reward": 127.45,
  "std_reward": 18.3,
  "model_path": "agent/models/trial-042/model.zip",
  "champion": true,
  "elapsed_sec": 187.4,
  "run_status": "ok"
}

Champion Manifest (`agent/models/champion/manifest.json`)

{
  "trial": 42,
  "timestamp": "2026-04-13T03:14:15.926535",
  "params": { "..." },
  "mean_reward": 127.45,
  "model_path": "agent/models/champion/model.zip"
}

GP State (in-memory, rebuilt each iteration from JSONL)

X: [N, n_params]  normalized parameter vectors
y: [N]            normalized mean rewards
GP: TinyGP fitted to (X, y)

5. Interface Design

Runner CLI (`donkeycar_sb3_runner.py`)

python3 donkeycar_sb3_runner.py \
  --agent ppo|dqn \
  --env donkey-generated-roads-v0 \
  --timesteps 10000 \
  --eval-episodes 5 \
  --n-steer 7 \
  --n-throttle 3 \
  --learning-rate 0.0003 \
  --save-dir agent/models/trial-042 \
  --seed 42 \
  --reward-shaping

Autoresearch Controller CLI

python3 autoresearch_controller.py \
  --trials 100 \
  --explore 2.0 \
  --agent ppo \
  --min-timesteps 5000 \
  --max-timesteps 20000 \
  --push-every 10

Evaluation / Demo CLI (`evaluate_champion.py`)

python3 evaluate_champion.py \
  --model agent/models/champion/model.zip \
  --env donkey-mountain-track-v0 \
  --episodes 10

6. Architecture Decisions

Constraints

MUST: Always call env.close() before process exit
MUST: Save every trained model — never discard
MUST: Use evaluate_policy() from SB3 for evaluation — not a custom loop
MUST: Append to JSONL results — never overwrite
MUST: All tests run without a live simulator
MUST NOT: Use model.save() before model is defined
MUST NOT: Run random actions in production inner loop (this was the original bug)
MUST NOT: Remove the 2-second cooldown between jobs
PREFER: PPO over DQN for continuous driving tasks (better suited)
PREFER: Pure numpy GP over sklearn to avoid dependency issues
PREFER: Reward shaping enabled by default for speed optimization
ESCALATE: If DonkeyCar gym API changes break env.reset() or env.step() signatures
ESCALATE: If simulator port 9091 is unavailable at test time
ESCALATE: If SB3 model save/load API changes between versions

Known Challenges

Simulator must be running: All live training requires the DonkeyCar sim on port 9091. Tests must mock this.
Episode length variance: Episodes end at 100 steps or CTE > 8. Mean reward has high variance across episodes.
Random seed handling: DonkeyCar gym.reset() signature differs between Gym and Gymnasium versions.
Model size: PPO models with CNN policy on 120x160x3 images can be large (>100MB). Consider git LFS or exclude from git.

Rejected Approaches

Rejected option	Why rejected	Scope
Random action inner loop	Produces meaningless reward signal — cannot optimize for trained driving	project
sklearn GP	Adds sklearn dependency, compatibility issues found previously	project
DQN for continuous actions	DQN requires discretized actions, PPO handles continuous natively	project
Grid sweep as primary search	Fixed grid misses best regions; GP+UCB finds n_steer=8, n_throttle=5 which was not in grid	project
100/200 trial arbitrary batches	No principled stopping criterion; should use convergence detection instead	project
model.save() from legacy training function	Was undefined — caused NameError crash on every run for entire history	project

7. Phasing

Phase 1: Real Training Foundation (CURRENT — implement first)

Core goal: make the inner loop actually train and save models.

Rebuild donkeycar_sb3_runner.py with real PPO/DQN training + save
Add speed-aware reward shaping wrapper
Add proper evaluate_policy() evaluation
Fix autoresearch controller to pass learning_rate to runner
Add champion model tracking
Write tests for all core logic
Re-run autoresearch with real training (50 trials minimum)

Phase 2: Generalization (after Phase 1 champion exists)

Core goal: the champion model drives ANY track.

Multi-track evaluation script
Curriculum learning: train on 2+ tracks
Domain randomization wrapper
Convergence detection in autoresearch (stop when GP uncertainty collapses)
Automatic Gitea push every N trials

Phase 3: Racing (after Phase 2 — generalization proven)

Core goal: fastest possible lap times.

Lap time measurement and logging
Reward function tuned for pure speed (with safety constraints)
Fine-tuning from champion checkpoint on new tracks
Head-to-head comparison: autoresearch champion vs human-tuned config
Research paper / writeup structure

8. Reference Materials

External Docs

DonkeyCar Gym: https://github.com/tawnkramer/gym-donkeycar
Stable-Baselines3: https://stable-baselines3.readthedocs.io/
Gymnasium migration: https://gymnasium.farama.org/introduction/migration_guide/

Existing Code to Learn From

agent/discretize_action.py — action space wrapper (working, tested in production)
agent/autoresearch_controller.py — GP+UCB loop (working, needs inner loop fix)
agent/outerloop-results/clean_sweep_results.jsonl — 18 records of base data
agent/outerloop-results/autoresearch_results.jsonl — 300 trial records (random policy — useful for discretization insights, NOT for learning_rate tuning)

Anti-patterns (DO NOT REPEAT)

Calling model.save() before model is defined — crashes with NameError
Using env.action_space.sample() in the "training" loop — this is random, not RL
Ignoring the learning_rate argument in the runner (was passed but unused for 300 trials)
Arbitrary trial count limits — use convergence detection instead
Not calling env.close() — causes simulator zombie/hang

9. Evaluation Design

RL Eval Approach

Unlike software unit tests, RL reward is stochastic. Evaluation strategy:

Run N_EVAL_EPISODES per trial (default 5)
Record mean ± std reward
Champion = highest mean reward across all trials
Convergence = GP uncertainty (sigma) drops below threshold across all candidates

Test Cases (Simulator-Free)

TC-001: Action Space Encoding

Input: n_steer=5, n_throttle=3 → action index 7
Expected: Decoded to approximately (steer=0.0, throttle=0.5)
Verification: pytest tests/test_discretize_action.py::test_decode_action

TC-002: GP Fit and UCB Proposal

Input: 18 data points from clean_sweep_results.jsonl
Expected: GP proposes params with n_steer ∈ [6,9] and lr ∈ [0.001, 0.004] (the high-reward region identified in 300 trials)
Verification: pytest tests/test_autoresearch_controller.py::test_ucb_proposal_in_high_reward_region

TC-003: Param Encoding Round-Trip

Input: {'n_steer': 7, 'n_throttle': 3, 'learning_rate': 0.002}
Expected: encode → decode round-trip reproduces exact values (within int rounding)
Verification: pytest tests/test_autoresearch_controller.py::test_param_roundtrip

TC-004: Champion Tracking

Input: Trial sequence with rewards [50, 80, 60, 90, 70]
Expected: Champion is updated at trials 1, 2, 4 (rewards 50, 80, 90)
Verification: pytest tests/test_autoresearch_controller.py::test_champion_tracking

TC-005: Runner Exits Cleanly

Input: Mocked gym environment, 100 timesteps, PPO
Expected: Runner completes, calls env.close(), exits with code 0, model.zip exists
Verification: pytest tests/test_runner_integration.py::test_runner_exits_cleanly

Regression Baselines

Saved after Phase 1 completion:

best_params_after_300_random_trials.json — discretization insight baseline
champion_reward_phase1.txt — first real training champion reward

20 KiB Raw Blame History