259 lines
10 KiB
Markdown
259 lines
10 KiB
Markdown
# Architecture Decision Records — DonkeyCar RL Autoresearch
|
|
|
|
> One ADR per major non-obvious technical choice.
|
|
> Agents read this to avoid re-opening settled decisions.
|
|
|
|
---
|
|
|
|
## ADR-001: PPO over DQN as Primary Agent
|
|
|
|
**Date:** 2026-04-13
|
|
**Status:** Accepted
|
|
|
|
**Context:** DonkeyCar driving is a continuous control problem (steer ∈ [-1,1], throttle ∈ [0,1]). DQN requires discrete action spaces; we worked around this with DiscretizedActionWrapper. PPO supports continuous action spaces natively.
|
|
|
|
**Decision:** Use PPO as the primary agent. Keep DQN support for discrete action experiments.
|
|
|
|
**Consequences:**
|
|
- PPO trains faster on continuous driving tasks (no discretization artifacts)
|
|
- No need for DiscretizedActionWrapper with PPO (but keep it for DQN experiments)
|
|
- PPO with CnnPolicy handles raw image observations natively
|
|
|
|
**Rejected alternatives:**
|
|
- DQN only — requires discretization; loses steering resolution
|
|
- SAC — valid alternative but PPO is simpler and well-tested on DonkeyCar
|
|
|
|
---
|
|
|
|
## ADR-002: Pure Numpy GP (TinyGP) over sklearn
|
|
|
|
**Date:** 2026-04-13
|
|
**Status:** Accepted
|
|
|
|
**Context:** We need a Gaussian Process surrogate model for the autoresearch controller. sklearn.gaussian_process exists but has had compatibility issues with our numpy version.
|
|
|
|
**Decision:** Use TinyGP — a pure numpy RBF kernel GP implemented in autoresearch_controller.py.
|
|
|
|
**Consequences:**
|
|
- No sklearn dependency
|
|
- Full control over kernel and noise parameters
|
|
- Slightly less optimized than sklearn but sufficient for < 1000 data points
|
|
|
|
**Rejected alternatives:**
|
|
- sklearn GaussianProcessRegressor — dependency issues
|
|
- GPyTorch — overkill, adds PyTorch dependency
|
|
- Botorch — same
|
|
|
|
---
|
|
|
|
## ADR-003: JSONL Append-Only Results
|
|
|
|
**Date:** 2026-04-13
|
|
**Status:** Accepted
|
|
|
|
**Context:** Results from 300+ trials must be persistent, recoverable, and never lost.
|
|
|
|
**Decision:** All results are appended to JSONL files. Results files are never truncated or overwritten.
|
|
|
|
**Consequences:**
|
|
- System can be interrupted and resumed at any point
|
|
- Historical data is preserved even if a later trial fails
|
|
- Easy to parse with `json.loads(line)` per line
|
|
|
|
**Rejected alternatives:**
|
|
- SQLite — adds dependency, overkill for this volume
|
|
- CSV — loses type information, harder to extend
|
|
|
|
---
|
|
|
|
## ADR-004: GP+UCB Bayesian Optimization for Hyperparameter Search
|
|
|
|
**Date:** 2026-04-13
|
|
**Status:** Accepted
|
|
|
|
**Context:** We need an intelligent hyperparameter search strategy. Grid search was the starting point but misses non-grid-aligned optimal regions (proven: n_steer=8 was NOT in the original grid of [3,5,7]).
|
|
|
|
**Decision:** Gaussian Process + Upper Confidence Bound (UCB) acquisition. GP models the reward landscape; UCB balances exploration vs exploitation.
|
|
|
|
**kappa=2.0** default: reasonable balance, can be increased for more exploration.
|
|
|
|
**Consequences:**
|
|
- Finds optimal regions with fewer trials than grid search
|
|
- Naturally handles continuous parameter spaces (learning_rate ∈ [0.00005, 0.005])
|
|
- Requires at least 2 data points before GP can be fit (random sampling for first 2 trials)
|
|
|
|
**Rejected alternatives:**
|
|
- Random search — better than grid but no learning
|
|
- Tree Parzen Estimator (TPE/Optuna) — valid alternative, adds dependency
|
|
- CMA-ES — better for high-dimensional spaces; our space is 3D, GP is sufficient
|
|
- Population-Based Training (PBT) — requires parallel sim instances (we only have 1)
|
|
|
|
---
|
|
|
|
## ADR-005: No Model Saving Before Model is Defined
|
|
|
|
**Date:** 2026-04-13
|
|
**Status:** Accepted (bug fix — never repeat)
|
|
|
|
**Context:** The original donkeycar_sb3_runner.py called `model.save(save_path)` after removing the model training code. This caused `NameError: name 'model' is not defined` on every single run for 300 trials.
|
|
|
|
**Decision:** Never call `model.save()` without first verifying `model` is defined. Training and saving must be atomic — if training fails, no save attempt.
|
|
|
|
**Pattern:**
|
|
```python
|
|
try:
|
|
model = PPO('CnnPolicy', env, ...)
|
|
model.learn(total_timesteps=timesteps)
|
|
model.save(save_path)
|
|
except Exception as e:
|
|
log(f'Training failed: {e}')
|
|
sys.exit(102)
|
|
```
|
|
|
|
**Rejected alternatives:**
|
|
- Checking `if 'model' in locals()` before save — fragile, hides bugs
|
|
|
|
---
|
|
|
|
## ADR-006: env.close() + 2-Second Cooldown is Non-Negotiable
|
|
|
|
**Date:** 2026-04-13
|
|
**Status:** Accepted
|
|
|
|
**Context:** Early in the project, not calling env.close() between runs caused simulator zombie processes that locked up the entire system. 20+ consecutive runs work reliably with this pattern.
|
|
|
|
**Decision:** Every runner process MUST:
|
|
1. Call `env.close()` in a try/except before exit
|
|
2. Sleep 2 seconds after close
|
|
3. Then exit
|
|
|
|
This applies even if training or evaluation fails.
|
|
|
|
**Rejected alternatives:**
|
|
- Relying on Python garbage collection for env cleanup — proven to cause hangs
|
|
|
|
---
|
|
|
|
## ADR-007: PPO with CnnPolicy for Image Observations
|
|
|
|
**Date:** 2026-04-13
|
|
**Status:** Accepted
|
|
|
|
**Context:** DonkeyCar provides 120x160x3 RGB camera images as observations. The policy must process images.
|
|
|
|
**Decision:** Use `PPO('CnnPolicy', env, ...)` from SB3. CnnPolicy automatically handles image preprocessing with a CNN feature extractor.
|
|
|
|
**Consequences:**
|
|
- Larger model than MlpPolicy (image processing overhead)
|
|
- Requires VecTransposeImage wrapper (SB3 handles this internally)
|
|
- Training is slower per step but produces better driving behavior
|
|
|
|
**Rejected alternatives:**
|
|
- MlpPolicy — cannot handle raw image inputs
|
|
- Custom CNN — unnecessary complexity given SB3's built-in CnnPolicy
|
|
|
|
---
|
|
|
|
## ADR-008: All Phases Planned, Phase 1 Executed First
|
|
|
|
**Date:** 2026-04-13
|
|
**Status:** Accepted
|
|
|
|
**Context:** User asked whether to implement Phase 1 only or all phases. Three phases identified:
|
|
1. Real Training Foundation
|
|
2. Multi-Track Generalization
|
|
3. Racing / Speed Optimization
|
|
|
|
**Decision:** Plan all phases in full documentation, execute Phase 1 first. Do not start Phase 2 until Phase 1 produces a genuine champion model (mean_reward > 100 on training track). This creates a wave gate between Phase 1 and Phase 2.
|
|
|
|
**Rationale:** Phase 2 and 3 depend on having a real trained model. Without Phase 1 complete, there is nothing to generalize or optimize for speed.
|
|
|
|
---
|
|
|
|
## ADR-009: Tests Must Not Require Live Simulator
|
|
|
|
**Date:** 2026-04-13
|
|
**Status:** Accepted
|
|
|
|
**Context:** The DonkeyCar simulator must be running on port 9091 for live training. Tests cannot depend on this.
|
|
|
|
**Decision:** All pytest tests mock the gym environment. Integration tests use a MagicMock gym env that returns fake observations, rewards, and done signals. Only manual/acceptance tests require the live simulator.
|
|
|
|
**Pattern:**
|
|
```python
|
|
@patch('gymnasium.make')
|
|
def test_runner_exits_cleanly(mock_make):
|
|
mock_env = MagicMock()
|
|
mock_env.reset.return_value = (np.zeros((120,160,3)), {})
|
|
mock_env.step.return_value = (np.zeros((120,160,3)), 1.0, True, False, {})
|
|
mock_env.action_space = gym.spaces.Box(...)
|
|
mock_make.return_value = mock_env
|
|
# ... test runner
|
|
```
|
|
|
|
---
|
|
|
|
## ADR-010: Warren is an Outdoor/Road Track — Include in Generalization Benchmark
|
|
|
|
**Date:** 2026-04-12
|
|
**Status:** Accepted
|
|
|
|
**Context:** Warren (UCSD Warren Track v1.0) is under a tent but has proper road geometry:
|
|
white lane lines, yellow centre dashes, orange traffic cones. Unlike purely indoor tracks
|
|
(Robo Racing League, Waveshare, Circuit Launch, Warehouse) which use a carpet/hard floor
|
|
as the road surface with painted lines, Warren has an actual grass+painted-road layout
|
|
with genuine road markings.
|
|
|
|
**Decision:** Warren is classified as a "pseudo-outdoor" track — visually similar to
|
|
outdoor road tracks despite being sheltered. It is included in the zero-shot test set
|
|
(alongside mini_monaco) rather than the indoor-skip category.
|
|
|
|
**Consequence:** The Wave 3 generalization benchmark = 2 held-out tracks:
|
|
mini_monaco (outdoor trees + fence) + warren (pseudo-outdoor tent + road markings).
|
|
|
|
---
|
|
|
|
## ADR-011: Wave 3 Zero-Shot Generalization — Test Tracks Never Used in Training
|
|
|
|
**Date:** 2026-04-12
|
|
**Status:** Accepted
|
|
|
|
**Context:** Visual overfitting confirmed — Phase 2 champion drives only the track it was
|
|
trained on (generated_road). CNN learned background-specific features (desert horizon,
|
|
sky colour) rather than road-invariant features (lane markings, road edges).
|
|
|
|
**Decision:** Wave 3 uses a strict train/test split:
|
|
- **Training tracks:** generated_road, generated_track, mountain_track
|
|
- **Test tracks (zero-shot only):** mini_monaco, warren
|
|
- **Optimisation target:** `combined_test_score = mini_monaco_mean_reward + warren_mean_reward`
|
|
(the GP ONLY sees test-track performance — training performance is not the objective)
|
|
|
|
**Rationale:** This mirrors established domain generalisation practice. If we train the GP
|
|
on training reward, we could find hyperparams that overfit the training tracks while still
|
|
failing the test tracks. Only test performance correctly measures generalisation.
|
|
|
|
**Consequence:** Zero-shot evaluation happens at the end of every trial. If a trial crashes
|
|
both test tracks, score=0. GP learns that those hyperparameters don't generalise.
|
|
|
|
---
|
|
|
|
## ADR-012: Warm-Start from Phase 2 Champion for Wave 3
|
|
|
|
**Date:** 2026-04-12
|
|
**Status:** Accepted
|
|
|
|
**Context:** Training PPO from scratch across 3 tracks would require ~500k+ timesteps to
|
|
reach a competent policy. Phase 2 champion (Trial 20) already drives generated_road well.
|
|
|
|
**Decision:** All Wave 3 trials warm-start from `models/champion/model.zip` (Phase 2
|
|
champion). `PPO.load(path, env=new_env)` loads weights; `model.learning_rate` is then
|
|
overridden with the GP-proposed learning rate. Falls back to fresh PPO if load fails.
|
|
|
|
**Rationale:** The champion already knows how to follow a road. Warm-starting means Wave 3
|
|
only needs to teach *generalisation* — learning to apply the same skill to new visual
|
|
inputs. This is far more efficient than teaching driving from scratch.
|
|
|
|
**Risk:** If the champion's policy is over-specialised (e.g., relies on very specific pixel
|
|
features of desert background), warm-starting could hinder generalisation. This is why the
|
|
GP tunes learning_rate — a higher LR will more aggressively overwrite specialised features.
|