diff --git a/IMPLEMENTATION_PLAN.md b/IMPLEMENTATION_PLAN.md index e859381..ff69464 100644 --- a/IMPLEMENTATION_PLAN.md +++ b/IMPLEMENTATION_PLAN.md @@ -6,72 +6,68 @@ --- -## Wave 1: Real Training Foundation -**Goal:** Make the inner loop actually train and save models. Produce a real champion model. -**Gate:** champion model achieves mean_reward > 100 on training track. +## ✅ Wave 1: Real Training Foundation — COMPLETE +All tasks done. Phase 1 champion achieved genuine forward driving. + +## ✅ Wave 2: Track Completion — COMPLETE +All top 3 Phase 2 models complete the full track. +Champion: Trial 20 — n_steer=3, n_throttle=5, lr=0.000225, 13k steps. +Driving style: Right lane, very stable. Completes full track in ~2874 steps. +Key finding: n_steer=3 > n_steer=4 (fewer bins = more decisive = less oscillation). + +--- + +## Wave 3: Behavioral Control & Speed Optimization +**Goal:** Control driving style (lane, oscillation), measure lap time, optimize for speed. +**Gate:** Phase 2 champion completes full track (DONE ✅). **Status:** 🟠 In progress -### Stream 1A: Core Runner Rebuild +### Stream 3A: Enhanced Evaluator + Metrics -- [ ] **1A-01** — Rebuild `donkeycar_sb3_runner.py` with real PPO training (`model.learn()`), model save, and proper evaluation (`evaluate_policy()`) -- [ ] **1A-02** — Add `SpeedRewardWrapper` — reward = `speed * (1 - abs(cte)/max_cte)`; add `--reward-shaping` flag -- [ ] **1A-03** — Add champion model tracking — write `champion_manifest.json` when new best is found -- [ ] **1A-04** — Fix autoresearch controller to pass `learning_rate`, `save_dir`, `reward_shaping` args to runner +- [x] **3A-01** — Update champion to Phase 2 Trial 20 +- [ ] **3A-02** — Add lap time measurement to evaluate_champion.py +- [ ] **3A-03** — Add steering oscillation metric (std of steering actions per episode) +- [ ] **3A-04** — Add lane position histogram (distribution of CTE values) +- [ ] **3A-05** — Save eval summary to `outerloop-results/eval_summary.jsonl` -### Stream 1B: Tests +### Stream 3B: Behavioral Reward Variants -- [ ] **1B-01** — Write `tests/test_discretize_action.py` — action encoding, decoding, round-trip -- [ ] **1B-02** — Write `tests/test_autoresearch_controller.py` — GP fit, UCB computation, param round-trip, champion tracking -- [ ] **1B-03** — Write `tests/test_runner_integration.py` — mocked sim, training + save + eval cycle +- [ ] **3B-01** — `LanePositionWrapper`: reward = `1 - abs(cte - target)/max_cte` with configurable target CTE offset +- [ ] **3B-02** — `AntiOscillationWrapper`: adds penalty for rapid steering changes (smoothness reward) +- [ ] **3B-03** — `AsymmetricCTEWrapper`: penalizes left-of-center more (enforces right-lane rule) +- [ ] **3B-04** — Tests for all three wrappers (no simulator required) +- [ ] **3B-05** — Integrate wrapper selection into autoresearch_controller.py via `--behavior` flag -### Stream 1C: First Real Autoresearch Run +### Stream 3C: Speed Optimization -- [ ] **1C-01** — Run 50-trial autoresearch with real PPO training; verify models saved -- [ ] **1C-02** — Save regression baseline: `champion_reward_phase1.txt` -- [ ] **1C-03** — Push all results and models to Gitea -- [ ] **1C-04** — Write Wave 1 process eval +- [ ] **3C-01** — Measure actual lap time using `last_lap_time` from sim info dict +- [ ] **3C-02** — Update reward to incorporate lap time: `reward += lap_bonus if lap_completed` +- [ ] **3C-03** — Run targeted autoresearch starting from Phase 2 champion checkpoint +- [ ] **3C-04** — Fine-tuning: load Phase 2 champion weights, continue training with speed reward + +### Stream 3D: Multi-Track Generalization + +- [ ] **3D-01** — Evaluate champion on 2nd track (e.g., `donkey-mountain-track-v0`) +- [ ] **3D-02** — Track-agnostic training: alternate episodes between 2 tracks +- [ ] **3D-03** — Measure generalization gap (train_track vs unseen_track reward) --- -## Wave 2: Multi-Track Generalization -**Goal:** Champion model drives any track with mean_reward > 50. -**Gate:** Wave 1 champion achieves mean_reward > 100. Wave 1 process eval complete. -**Status:** ⏸️ Not started — blocked on Wave 1 +## Wave 4: Racing (future) +**Goal:** Fastest possible lap on any track. +**Gate:** Wave 3 complete. Multi-track generalization proven. +**Status:** ⏸️ Not started -- [ ] **2-01** — Write `evaluate_champion.py` — load champion model, evaluate on specified track -- [ ] **2-02** — Implement multi-track training curriculum (train on 2 tracks alternately) -- [ ] **2-03** — Add domain randomization wrapper (randomize road width, lighting) -- [ ] **2-04** — Implement convergence detection in autoresearch (stop when GP sigma collapses) -- [ ] **2-05** — Add automatic Gitea push every N trials -- [ ] **2-06** — Evaluate champion on unseen track; record generalization gap - ---- - -## Wave 3: Racing / Speed Optimization -**Goal:** Fastest possible lap times on any track. -**Gate:** Wave 2 champion generalizes to ≥1 unseen track (mean_reward > 50). -**Status:** ⏸️ Not started — blocked on Wave 2 - -- [ ] **3-01** — Implement lap time measurement and logging -- [ ] **3-02** — Tune reward function for pure speed (aggressive speed weight) -- [ ] **3-03** — Fine-tuning from champion checkpoint on new tracks -- [ ] **3-04** — Head-to-head: autoresearch champion vs human-tuned baseline -- [ ] **3-05** — Research writeup / report - ---- - -## Completion Signals - -The agent outputs one of these at the end of each iteration: -- `PLANNED` — just created/updated the plan, ready to implement -- `DONE` — all tasks in current wave complete -- `STUCK` — needs human input (see ESCALATION REQUIRED block if present) -- `ERROR` — unrecoverable error +- [ ] **4-01** — Pure lap time reward (replace CTE-based reward with time-based) +- [ ] **4-02** — Head-to-head: autoresearch champion vs human-tuned config +- [ ] **4-03** — Research paper / writeup structure --- ## Notes -- **Random policy data (300 trials):** The existing autoresearch_results.jsonl contains rewards from random-action policy runs. These are valid for n_steer/n_throttle discretization insights but NOT for learning_rate optimization. Do not mix with Phase 1 real training results. Create a separate results file: `autoresearch_results_phase1.jsonl`. -- **Model storage:** Large CNN models (>100MB) should be excluded from git or use git LFS. Add `agent/models/**/*.zip` to .gitignore if needed, and document download location. -- **Simulator requirement:** All live training tasks (1C-*) require DonkeyCar sim running on port 9091. Tests (1B-*) do NOT require the simulator. +- **Phase 2 key finding:** n_steer=3 outperforms n_steer=4 (counterintuitive — fewer bins = better) +- **CTE symmetry:** reward is symmetric → car picks left or right based on random NN init +- **Track ends!** The track has a physical finish — runs end on track completion, not timeout +- **Reward v4 (base × efficiency × speed):** Successfully eliminated all circular driving exploits +- **Champion model path:** `agent/models/champion/model.zip` (Trial 20, Phase 2) diff --git a/agent/behavioral_wrappers.py b/agent/behavioral_wrappers.py new file mode 100644 index 0000000..5ef6238 --- /dev/null +++ b/agent/behavioral_wrappers.py @@ -0,0 +1,277 @@ +""" +Behavioral Reward Wrappers for DonkeyCar RL — Phase 3 +====================================================== + +These wrappers extend the base SpeedRewardWrapper (v4) with behavioral +control mechanisms discovered in Phase 2: + + 1. LanePositionWrapper — drive at a specific lateral position + 2. AntiOscillationWrapper — suppress steering oscillation + 3. AsymmetricCTEWrapper — enforce right-lane rule (penalise left more) + +RESEARCH CONTEXT (Phase 2 findings): + - The base CTE reward is symmetric — car picks left or right based on + random NN initialisation → different driving styles emerge randomly + - n_steer=3 (fewer bins) produces cleaner, more stable driving than n_steer=4 + - These wrappers let us deliberately shape driving behaviour + +USAGE: + from reward_wrapper import SpeedRewardWrapper + from behavioral_wrappers import LanePositionWrapper, AntiOscillationWrapper + + env = LanePositionWrapper( + AntiOscillationWrapper( + SpeedRewardWrapper(base_env), + oscillation_penalty=0.05 + ), + target_cte=-0.3, # Slightly right of centre + position_weight=0.3 + ) +""" + +import gymnasium as gym +import numpy as np +from collections import deque + + +class LanePositionWrapper(gym.Wrapper): + """ + Biases the car to drive at a specific lateral position (target CTE). + + Adds a position bonus/penalty on top of any existing shaped reward: + position_bonus = position_weight × (1 - abs(cte - target_cte) / max_cte) + + Examples: + target_cte = 0.0 → drive on centre line (default CTE behaviour) + target_cte = -0.5 → drive slightly right of centre (right-lane rule) + target_cte = +0.5 → drive slightly left of centre + target_cte = -1.5 → hug the right shoulder (like Trial 18!) + + Args: + target_cte: desired CTE offset from centre (negative = right) + position_weight: how strongly to enforce the target (0=off, 0.3=moderate) + max_cte: track half-width (default 8.0, matches sim) + """ + + def __init__(self, env, target_cte: float = 0.0, position_weight: float = 0.2, max_cte: float = 8.0): + super().__init__(env) + self.target_cte = target_cte + self.position_weight = position_weight + self.max_cte = max_cte + + def step(self, action): + result = self.env.step(action) + if len(result) == 5: + obs, reward, terminated, truncated, info = result + else: + obs, reward, done, info = result + terminated, truncated = done, False + + cte = float(info.get('cte', 0.0) or 0.0) + position_bonus = self.position_weight * ( + 1.0 - min(abs(cte - self.target_cte) / self.max_cte, 1.0) + ) + shaped = reward + position_bonus if reward > 0 else reward # Only bonus when on track + + if len(result) == 5: + return obs, shaped, terminated, truncated, info + return obs, shaped, terminated, info + + +class AntiOscillationWrapper(gym.Wrapper): + """ + Penalises rapid changes in steering to suppress oscillating driving. + + Addresses the behaviour observed in Trial 8 (n_steer=4, oscillating). + Computes the change in steering from the previous step and subtracts + a scaled penalty from the reward. + + oscillation_penalty_amount = oscillation_penalty × |Δsteering| + + The steered action must be a continuous value or index — we track the + last action and penalise large changes. + + Args: + oscillation_penalty: scale factor for the steering change penalty + history_window: number of steps to compute average oscillation over + """ + + def __init__(self, env, oscillation_penalty: float = 0.05, history_window: int = 10): + super().__init__(env) + self.oscillation_penalty = oscillation_penalty + self.history_window = history_window + self._action_history = deque(maxlen=history_window) + self._last_action = None + + def reset(self, **kwargs): + result = self.env.reset(**kwargs) + self._action_history.clear() + self._last_action = None + return result + + def step(self, action): + result = self.env.step(action) + if len(result) == 5: + obs, reward, terminated, truncated, info = result + else: + obs, reward, done, info = result + terminated, truncated = done, False + + # Compute steering change penalty + if self._last_action is not None: + try: + curr = float(action[0]) if hasattr(action, '__len__') else float(action) + prev = float(self._last_action[0]) if hasattr(self._last_action, '__len__') else float(self._last_action) + delta = abs(curr - prev) + penalty = self.oscillation_penalty * delta + shaped = reward - penalty if reward > 0 else reward + except (TypeError, IndexError): + shaped = reward + else: + shaped = reward + + self._last_action = action + self._action_history.append(action) + + if len(result) == 5: + return obs, shaped, terminated, truncated, info + return obs, shaped, terminated, info + + def current_oscillation_score(self) -> float: + """Returns mean absolute steering change over history window.""" + if len(self._action_history) < 2: + return 0.0 + actions = list(self._action_history) + deltas = [] + for i in range(1, len(actions)): + try: + curr = float(actions[i][0]) if hasattr(actions[i], '__len__') else float(actions[i]) + prev = float(actions[i-1][0]) if hasattr(actions[i-1], '__len__') else float(actions[i-1]) + deltas.append(abs(curr - prev)) + except (TypeError, IndexError): + pass + return float(np.mean(deltas)) if deltas else 0.0 + + +class AsymmetricCTEWrapper(gym.Wrapper): + """ + Enforces right-lane driving by penalising left-of-centre more than right. + + In the default reward, CTE is symmetric — |CTE| only. This wrapper + applies an extra penalty when the car drifts left (positive CTE in + DonkeyCar convention means left-of-centre). + + Formula: + if cte > 0 (left of centre): extra_penalty = left_penalty × cte / max_cte + if cte < 0 (right of centre): no penalty (or small bonus) + + Args: + left_penalty: additional penalty multiplier for left-of-centre driving + right_bonus: small bonus for right-of-centre driving (optional) + max_cte: track half-width (default 8.0) + """ + + def __init__(self, env, left_penalty: float = 0.3, right_bonus: float = 0.05, max_cte: float = 8.0): + super().__init__(env) + self.left_penalty = left_penalty + self.right_bonus = right_bonus + self.max_cte = max_cte + + def step(self, action): + result = self.env.step(action) + if len(result) == 5: + obs, reward, terminated, truncated, info = result + else: + obs, reward, done, info = result + terminated, truncated = done, False + + if reward > 0: # Only modify reward when on track + cte = float(info.get('cte', 0.0) or 0.0) + if cte > 0: # Left of centre — penalise + penalty = self.left_penalty * min(cte / self.max_cte, 1.0) + shaped = reward * (1.0 - penalty) + else: # Right of centre — small bonus + bonus = self.right_bonus * min(abs(cte) / self.max_cte, 1.0) + shaped = reward * (1.0 + bonus) + else: + shaped = reward + + if len(result) == 5: + return obs, shaped, terminated, truncated, info + return obs, shaped, terminated, info + + +class CombinedBehavioralWrapper(gym.Wrapper): + """ + Convenience wrapper combining all three behavioral controls. + Apply this on top of SpeedRewardWrapper (v4). + + Args: + target_cte: desired lateral position (default 0.0 = centre) + position_weight: lane position enforcement strength (default 0.2) + oscillation_penalty: steering smoothness enforcement (default 0.05) + enforce_right_lane: if True, apply asymmetric CTE penalty (default False) + max_cte: track half-width (default 8.0) + """ + + def __init__( + self, + env, + target_cte: float = 0.0, + position_weight: float = 0.2, + oscillation_penalty: float = 0.05, + enforce_right_lane: bool = False, + max_cte: float = 8.0, + ): + super().__init__(env) + self.target_cte = target_cte + self.position_weight = position_weight + self.oscillation_penalty = oscillation_penalty + self.enforce_right_lane = enforce_right_lane + self.max_cte = max_cte + self._last_action = None + + def reset(self, **kwargs): + self._last_action = None + return self.env.reset(**kwargs) + + def step(self, action): + result = self.env.step(action) + if len(result) == 5: + obs, reward, terminated, truncated, info = result + else: + obs, reward, done, info = result + terminated, truncated = done, False + + cte = float(info.get('cte', 0.0) or 0.0) + + if reward > 0: + shaped = reward + + # 1. Lane position bonus + pos_bonus = self.position_weight * ( + 1.0 - min(abs(cte - self.target_cte) / self.max_cte, 1.0) + ) + shaped += pos_bonus + + # 2. Anti-oscillation penalty + if self._last_action is not None: + try: + curr = float(action[0]) if hasattr(action, '__len__') else float(action) + prev = float(self._last_action[0]) if hasattr(self._last_action, '__len__') else float(self._last_action) + shaped -= self.oscillation_penalty * abs(curr - prev) + except (TypeError, IndexError): + pass + + # 3. Right-lane enforcement (asymmetric CTE) + if self.enforce_right_lane and cte > 0: + penalty = 0.3 * min(cte / self.max_cte, 1.0) + shaped *= (1.0 - penalty) + else: + shaped = reward + + self._last_action = action + + if len(result) == 5: + return obs, shaped, terminated, truncated, info + return obs, shaped, terminated, info diff --git a/agent/evaluate_champion.py b/agent/evaluate_champion.py index 3cff14f..14b881e 100644 --- a/agent/evaluate_champion.py +++ b/agent/evaluate_champion.py @@ -1,169 +1,291 @@ """ -Champion Model Evaluator -======================== -Loads the champion model and runs it live in the simulator for visual inspection. -Prints per-step diagnostics: position, speed, CTE, efficiency, reward. +Enhanced Champion Evaluator — Phase 3 +====================================== +Evaluates a model with full metrics: + - Total reward per episode + - Lap time (using sim's last_lap_time) + - Steering oscillation score (std of steering changes) + - Lane position histogram (CTE distribution) + - Path efficiency throughout episode + - Per-step diagnostics: speed, CTE, efficiency, reward, position Usage: - python3 evaluate_champion.py [--episodes N] [--steps N] + # Evaluate current champion + python3 evaluate_champion.py -Watch the simulator window to see if the car is genuinely driving the track -or exploiting circular motion. + # Evaluate a specific model + python3 evaluate_champion.py --model models/trial-0020/model.zip + + # Long run to see lap completion + python3 evaluate_champion.py --episodes 3 --steps 3000 + + # Compare all top Phase 2 models + python3 evaluate_champion.py --compare """ import os import sys import time import json +import math import numpy as np from collections import deque +from datetime import datetime import gymnasium as gym import gym_donkeycar from stable_baselines3 import PPO -# Add agent dir to path for wrappers sys.path.insert(0, os.path.dirname(os.path.abspath(__file__))) -from reward_wrapper import SpeedRewardWrapper from donkeycar_sb3_runner import ThrottleClampWrapper +from reward_wrapper import SpeedRewardWrapper CHAMPION_DIR = os.path.join(os.path.dirname(__file__), 'models', 'champion') MANIFEST_PATH = os.path.join(CHAMPION_DIR, 'manifest.json') -MODEL_PATH = os.path.join(CHAMPION_DIR, 'model.zip') +EVAL_SUMMARY = os.path.join(os.path.dirname(__file__), 'outerloop-results', 'eval_summary.jsonl') + +# Top Phase 2 models for comparison +PHASE2_MODELS = [ + { + 'label': 'Trial-20 Phase2-CHAMPION (n_steer=3 n_throttle=5 lr=0.000225 13k)', + 'path': 'models/trial-0020/model.zip', + 'style': 'Right lane, stable', + }, + { + 'label': 'Trial-8 Phase2-2nd (n_steer=4 n_throttle=3 lr=0.00117 34k)', + 'path': 'models/trial-0008/model.zip', + 'style': 'Left/center, oscillating', + }, + { + 'label': 'Trial-18 Phase2-3rd (n_steer=3 n_throttle=5 lr=0.000288 16k)', + 'path': 'models/trial-0018/model.zip', + 'style': 'Right shoulder, very accurate', + }, +] def load_manifest(): - with open(MANIFEST_PATH) as f: - return json.load(f) - - -def print_banner(manifest): - print('=' * 65, flush=True) - print('🏆 DonkeyCar Champion Model Evaluation', flush=True) - print('=' * 65, flush=True) - print(f" Trial: {manifest['trial']}", flush=True) - print(f" mean_reward: {manifest['mean_reward']:.4f}", flush=True) - print(f" Params: {manifest['params']}", flush=True) - print(f" Model: {MODEL_PATH}", flush=True) - print('=' * 65, flush=True) - print(flush=True) + if os.path.exists(MANIFEST_PATH): + with open(MANIFEST_PATH) as f: + return json.load(f) + return {} def compute_efficiency(pos_history): - """Path efficiency = net_displacement / total_path_length over window.""" if len(pos_history) < 3: return 1.0 positions = list(pos_history) net = np.linalg.norm(np.array(positions[-1]) - np.array(positions[0])) - total = sum( - np.linalg.norm(np.array(positions[i+1]) - np.array(positions[i])) - for i in range(len(positions)-1) - ) + total = sum(np.linalg.norm(np.array(positions[i+1]) - np.array(positions[i])) + for i in range(len(positions)-1)) return float(net / total) if total > 1e-6 else 1.0 -def run_episode(model, env, episode_num, max_steps=500): - """Run one episode with the champion policy, printing diagnostics.""" - print(f'\n--- Episode {episode_num} ---', flush=True) - obs, info = env.reset() - pos_history = deque(maxlen=30) - total_reward = 0.0 - step = 0 - - print(f'{"Step":>5} {"Speed":>6} {"CTE":>7} {"Eff%":>6} {"Rwd":>8} {"TotRwd":>10} {"Pos_x":>8} {"Pos_z":>8}', flush=True) - print('-' * 65, flush=True) - - while step < max_steps: - action, _ = model.predict(obs, deterministic=True) - result = env.step(action) - if len(result) == 5: - obs, reward, terminated, truncated, info = result - done = terminated or truncated - else: - obs, reward, done, info = result - - # Extract diagnostics from info - speed = float(info.get('speed', 0.0) or 0.0) - cte = float(info.get('cte', 0.0) or 0.0) - pos = info.get('pos', None) - if pos is not None: - pos_history.append(list(pos)[:3]) - px, pz = pos[0], pos[2] if len(pos) > 2 else 0.0 - else: - px, pz = 0.0, 0.0 - - efficiency = compute_efficiency(pos_history) - total_reward += reward - step += 1 - - # Print every 10 steps or on done - if step % 10 == 0 or done: - print(f'{step:>5} {speed:>6.2f} {cte:>7.3f} {efficiency*100:>5.1f}% {reward:>8.3f} {total_reward:>10.2f} {px:>8.2f} {pz:>8.2f}', flush=True) - - if done: - print(f'\n ✅ Episode {episode_num} done after {step} steps | total_reward={total_reward:.2f}', flush=True) - break - - if step >= max_steps: - print(f'\n ⏱️ Episode {episode_num} reached max_steps={max_steps} | total_reward={total_reward:.2f}', flush=True) - - return total_reward, step +def print_banner(label, path): + print(f'\n{"="*68}', flush=True) + print(f'🔍 {label}', flush=True) + print(f' {path}', flush=True) + print(f'{"="*68}', flush=True) -def main(episodes=3, max_steps=500): - manifest = load_manifest() - print_banner(manifest) - - params = manifest['params'] - - print(f'[Eval] Connecting to simulator...', flush=True) - try: - env = gym.make('donkey-generated-roads-v0') - except Exception as e: - print(f'[Eval] FAILED to connect: {e}', flush=True) - sys.exit(1) - - # Apply same wrappers as training - env = ThrottleClampWrapper(env, throttle_min=0.2) - env = SpeedRewardWrapper(env, speed_scale=0.1) - print(f'[Eval] Wrappers applied: ThrottleClamp(min=0.2), SpeedRewardWrapper(scale=0.1)', flush=True) - - print(f'[Eval] Loading champion model from {MODEL_PATH}...', flush=True) - try: - model = PPO.load(MODEL_PATH, env=env) - print(f'[Eval] Model loaded successfully.', flush=True) - except Exception as e: - print(f'[Eval] FAILED to load model: {e}', flush=True) - env.close() - sys.exit(1) - - print(f'\n[Eval] Running {episodes} episodes (max {max_steps} steps each)...', flush=True) - print('[Eval] Watch the simulator window — is the car driving the track or circling?', flush=True) - +def run_eval(model, env, episodes, max_steps, label=''): + """Run evaluation and return full metrics.""" all_rewards = [] + all_steps = [] + all_lap_times = [] + all_osc_scores = [] + all_cte_distributions = [] + all_completed = [] + for ep in range(1, episodes + 1): - total_reward, steps = run_episode(model, env, ep, max_steps=max_steps) + obs, info = env.reset() + pos_hist = deque(maxlen=31) + total_reward = 0.0 + step = 0 + cte_values = [] + steering_actions = [] + laps_completed = 0 + lap_times = [] + + print(f'\n--- Episode {ep}/{episodes} ---', flush=True) + print(f'{"Step":>5} {"Spd":>5} {"CTE":>6} {"Eff%":>5} {"Rwd":>7} {"Tot":>9} {"Laps":>5} {"Px":>7} {"Pz":>7}', flush=True) + print('-' * 62, flush=True) + + while step < max_steps: + action, _ = model.predict(obs, deterministic=True) + result = env.step(action) + if len(result) == 5: + obs, reward, terminated, truncated, info = result + done = terminated or truncated + else: + obs, reward, done, info = result + + speed = float(info.get('speed', 0) or 0) + cte = float(info.get('cte', 0) or 0) + pos = info.get('pos', (0, 0, 0)) + px = pos[0] if pos else 0 + pz = pos[2] if len(pos) > 2 else 0 + lap_count = int(info.get('lap_count', 0) or 0) + last_lap_time = float(info.get('last_lap_time', 0) or 0) + + # Track new laps + if lap_count > laps_completed: + laps_completed = lap_count + if last_lap_time > 0: + lap_times.append(last_lap_time) + print(f'\n 🏁 LAP {laps_completed} COMPLETE! Time={last_lap_time:.2f}s', flush=True) + + pos_hist.append(np.array([px, 0., pz])) + cte_values.append(cte) + + # Track steering for oscillation score + try: + steer = float(action[0]) if hasattr(action, '__len__') else float(action) + steering_actions.append(steer) + except (TypeError, IndexError): + pass + + total_reward += reward + step += 1 + + eff = compute_efficiency(pos_hist) + + if step % 50 == 0 or done: + print(f'{step:>5} {speed:>5.2f} {cte:>6.2f} {eff*100:>4.0f}% ' + f'{reward:>7.3f} {total_reward:>9.1f} {laps_completed:>5} ' + f'{px:>7.1f} {pz:>7.1f}', flush=True) + + if done: + print(f'\n Episode {ep} ended after {step} steps | ' + f'total={total_reward:.1f} | laps={laps_completed}', flush=True) + break + + if step >= max_steps: + print(f'\n Episode {ep} reached max {max_steps} steps | ' + f'total={total_reward:.1f} | laps={laps_completed}', flush=True) + + # Compute oscillation score + if len(steering_actions) > 1: + deltas = [abs(steering_actions[i] - steering_actions[i-1]) + for i in range(1, len(steering_actions))] + osc_score = float(np.mean(deltas)) + else: + osc_score = 0.0 + all_rewards.append(total_reward) - if ep < episodes: - time.sleep(2) # Brief pause between episodes + all_steps.append(step) + all_lap_times.extend(lap_times) + all_osc_scores.append(osc_score) + all_cte_distributions.extend(cte_values) + all_completed.append(laps_completed > 0) - print('\n' + '=' * 65, flush=True) - print('📊 Evaluation Complete', flush=True) - print(f' Episodes: {episodes}', flush=True) - print(f' Rewards: {[f"{r:.1f}" for r in all_rewards]}', flush=True) - print(f' Mean reward: {sum(all_rewards)/len(all_rewards):.2f}', flush=True) - print(f' Std reward: {float(np.std(all_rewards)):.2f}', flush=True) - print('=' * 65, flush=True) + time.sleep(2) - env.close() - time.sleep(2) - print('[Eval] Done.', flush=True) + # Summary metrics + summary = { + 'label': label, + 'episodes': episodes, + 'mean_reward': float(np.mean(all_rewards)), + 'std_reward': float(np.std(all_rewards)), + 'mean_steps': float(np.mean(all_steps)), + 'laps_completed': sum(1 for r in all_rewards if r > 500), # proxy for completion + 'lap_times': all_lap_times, + 'mean_lap_time': float(np.mean(all_lap_times)) if all_lap_times else None, + 'oscillation_score': float(np.mean(all_osc_scores)), # lower = smoother + 'mean_abs_cte': float(np.mean([abs(c) for c in all_cte_distributions])), + 'cte_std': float(np.std(all_cte_distributions)), + 'mean_cte_signed': float(np.mean(all_cte_distributions)), # + = left, - = right + 'timestamp': datetime.now().isoformat(), + } + + return summary, all_rewards + + +def print_summary(summary): + print(f'\n📊 Metrics for: {summary["label"]}', flush=True) + print(f' Mean reward: {summary["mean_reward"]:.1f} ± {summary["std_reward"]:.1f}', flush=True) + print(f' Mean steps/ep: {summary["mean_steps"]:.0f}', flush=True) + print(f' Oscillation score: {summary["oscillation_score"]:.4f} (lower=smoother)', flush=True) + print(f' Mean |CTE|: {summary["mean_abs_cte"]:.3f} m from centre', flush=True) + print(f' Mean signed CTE: {summary["mean_cte_signed"]:.3f} m (+ =left, - =right)', flush=True) + cte_side = 'RIGHT of centre ➡️' if summary['mean_cte_signed'] < -0.1 else \ + 'LEFT of centre ⬅️' if summary['mean_cte_signed'] > 0.1 else 'CENTRED ↕️' + print(f' Lane position: {cte_side}', flush=True) + if summary['lap_times']: + print(f' Lap times: {[f"{t:.1f}s" for t in summary["lap_times"]]}', flush=True) + print(f' Best lap time: {min(summary["lap_times"]):.1f}s', flush=True) + print(flush=True) + + +def save_summary(summary): + os.makedirs(os.path.dirname(EVAL_SUMMARY), exist_ok=True) + with open(EVAL_SUMMARY, 'a') as f: + f.write(json.dumps(summary) + '\n') + + +def main(episodes=3, max_steps=3000, model_override=None, compare=False): + manifest = load_manifest() + + models_to_eval = [] + if compare: + for m in PHASE2_MODELS: + models_to_eval.append((m['label'], m['path'])) + else: + path = model_override or CHAMPION_DIR + '/model.zip' + label = model_override or f"Champion (Phase {manifest.get('phase', '?')} Trial {manifest.get('trial', '?')})" + models_to_eval.append((label, path)) + + all_summaries = [] + for label, path in models_to_eval: + print_banner(label, path) + + print(f'[Eval] Connecting to simulator...', flush=True) + try: + env = gym.make('donkey-generated-roads-v0') + except Exception as e: + print(f'[Eval] FAILED: {e}', flush=True) + sys.exit(1) + + env = ThrottleClampWrapper(env, throttle_min=0.2) + env = SpeedRewardWrapper(env, speed_scale=0.1) + + print(f'[Eval] Loading model: {path}', flush=True) + try: + model = PPO.load(path, env=env) + print(f'[Eval] Model loaded. Running {episodes} episodes × {max_steps} steps...', flush=True) + except Exception as e: + print(f'[Eval] FAILED to load: {e}', flush=True) + env.close() + continue + + summary, rewards = run_eval(model, env, episodes, max_steps, label) + print_summary(summary) + save_summary(summary) + all_summaries.append(summary) + + env.close() + time.sleep(3) + + if compare and len(all_summaries) > 1: + print('\n' + '=' * 68, flush=True) + print('🏁 COMPARISON TABLE', flush=True) + print('=' * 68, flush=True) + print(f'{"Model":<40} {"Reward":>8} {"Steps":>7} {"Osc":>6} {"CTE":>6} {"Side":>10}', flush=True) + print('-' * 68, flush=True) + for s in all_summaries: + side = '➡️ RIGHT' if s['mean_cte_signed'] < -0.1 else \ + '⬅️ LEFT' if s['mean_cte_signed'] > 0.1 else '↕️ CENTER' + name = s['label'][:40] + print(f'{name:<40} {s["mean_reward"]:>8.0f} {s["mean_steps"]:>7.0f} ' + f'{s["oscillation_score"]:>6.3f} {s["mean_abs_cte"]:>6.2f} {side:>10}', flush=True) if __name__ == '__main__': import argparse - parser = argparse.ArgumentParser() - parser.add_argument('--episodes', type=int, default=3, help='Number of eval episodes') - parser.add_argument('--steps', type=int, default=500, help='Max steps per episode') + parser = argparse.ArgumentParser(description='Evaluate DonkeyCar RL model with full metrics.') + parser.add_argument('--episodes', type=int, default=3) + parser.add_argument('--steps', type=int, default=3000) + parser.add_argument('--model', type=str, default=None, help='Override model path') + parser.add_argument('--compare', action='store_true', help='Compare all top Phase 2 models') args = parser.parse_args() - main(episodes=args.episodes, max_steps=args.steps) + main(episodes=args.episodes, max_steps=args.steps, model_override=args.model, compare=args.compare) diff --git a/agent/models/champion/manifest.json b/agent/models/champion/manifest.json index 3fc7aec..8faae16 100644 --- a/agent/models/champion/manifest.json +++ b/agent/models/champion/manifest.json @@ -1,15 +1,18 @@ { - "trial": 5, - "timestamp": "2026-04-13T12:45:43.093664", + "trial": 20, + "phase": 2, + "timestamp": "2026-04-14T09:25:40.280224", "params": { - "n_steer": 7, - "n_throttle": 3, - "learning_rate": 0.0006801262090358742, - "timesteps": 4787, + "n_steer": 3, + "n_throttle": 5, + "learning_rate": 0.00022474333387549633, + "timesteps": 13328, "agent": "ppo", - "eval_episodes": 3, + "eval_episodes": 5, "reward_shaping": true }, - "mean_reward": 4582.7984, + "mean_reward": 2469.28, + "eval_steps": 2874, + "driving_style": "Right lane, very stable, completes full track", "model_path": "/home/paulh/projects/donkeycar-rl-autoresearch/agent/models/champion/model.zip" } \ No newline at end of file diff --git a/agent/outerloop-results/autoresearch_phase2_log.txt b/agent/outerloop-results/autoresearch_phase2_log.txt index 08499d1..35954ab 100644 --- a/agent/outerloop-results/autoresearch_phase2_log.txt +++ b/agent/outerloop-results/autoresearch_phase2_log.txt @@ -475,3 +475,17 @@ [2026-04-14 04:35:49] mean_reward=2073.7372 params={'n_steer': 3, 'n_throttle': 5, 'learning_rate': 0.0002881292103575585, 'timesteps': 15876, 'agent': 'ppo', 'eval_episodes': 5, 'reward_shaping': True} [2026-04-14 04:35:49] mean_reward=1382.4461 params={'n_steer': 4, 'n_throttle': 3, 'learning_rate': 0.0010723485700433605, 'timesteps': 33234, 'agent': 'ppo', 'eval_episodes': 5, 'reward_shaping': True} [2026-04-14 04:35:49] mean_reward=1097.1248 params={'n_steer': 5, 'n_throttle': 3, 'learning_rate': 0.001421177467065464, 'timesteps': 33363, 'agent': 'ppo', 'eval_episodes': 5, 'reward_shaping': True} +[2026-04-14 04:35:50] [AutoResearch] Git push complete after trial 20 +[2026-04-14 09:28:23] [AutoResearch] GP UCB top-5 candidates: +[2026-04-14 09:28:23] UCB=2.3107 mu=0.3981 sigma=0.9563 params={'n_steer': 9, 'n_throttle': 2, 'learning_rate': 0.001405531880392808, 'timesteps': 26173} +[2026-04-14 09:28:23] UCB=2.3049 mu=0.8602 sigma=0.7224 params={'n_steer': 9, 'n_throttle': 3, 'learning_rate': 0.001793493447174312, 'timesteps': 19198} +[2026-04-14 09:28:23] UCB=2.2813 mu=0.4904 sigma=0.8954 params={'n_steer': 9, 'n_throttle': 4, 'learning_rate': 0.0011616192816742616, 'timesteps': 13887} +[2026-04-14 09:28:23] UCB=2.2767 mu=0.5194 sigma=0.8787 params={'n_steer': 9, 'n_throttle': 4, 'learning_rate': 0.0011646447444663046, 'timesteps': 21199} +[2026-04-14 09:28:23] UCB=2.2525 mu=0.6254 sigma=0.8136 params={'n_steer': 9, 'n_throttle': 3, 'learning_rate': 0.0010196345864901517, 'timesteps': 22035} +[2026-04-14 09:28:23] [Champion] 🏆 NEW BEST! Trial 1: mean_reward=50.0000 params={'n_steer': 5} +[2026-04-14 09:28:23] [Champion] 🏆 NEW BEST! Trial 1: mean_reward=80.0000 params={'n_steer': 7} +[2026-04-14 09:28:23] [Champion] 🏆 NEW BEST! Trial 0: mean_reward=50.0000 params={'r': 50} +[2026-04-14 09:28:23] [Champion] 🏆 NEW BEST! Trial 1: mean_reward=80.0000 params={'r': 80} +[2026-04-14 09:28:23] [Champion] 🏆 NEW BEST! Trial 3: mean_reward=90.0000 params={'r': 90} +[2026-04-14 09:28:23] [Champion] 🏆 NEW BEST! Trial 5: mean_reward=75.0000 params={'n_steer': 8} +[2026-04-14 09:28:23] [AutoResearch] Only 1 results — using random proposal. diff --git a/docs/RESEARCH_LOG.md b/docs/RESEARCH_LOG.md index edf5697..d6acf59 100644 --- a/docs/RESEARCH_LOG.md +++ b/docs/RESEARCH_LOG.md @@ -363,3 +363,54 @@ v4 guarantee that the worst-case exploit (CTE=0 spinning) earns near-zero reward **The lesson:** When efficiency is only applied to the SPEED BONUS, the base reward from the sim can still be gamed. The efficiency multiplier must apply to the ENTIRE reward. + +--- + +## 2026-04-14 — 🏆 PHASE 2 MILESTONE: All Top Models Complete the Track! + +### Finding: Track Completion Achieved — Multiple Distinct Driving Styles + +**User visual confirmation:** All 3 top Phase 2 models successfully complete the entire track! + +**Model comparison at 3000 steps:** + +| Model | Steps | Reward | Std | Driving Style | +|-------|-------|--------|-----|---------------| +| Trial 20 (n_steer=3, n_throttle=5, lr=0.000225, 13k steps) | **2874** | 2297 | 5.7 | Right lane, very stable ⭐ | +| Trial 8 (n_steer=4, n_throttle=3, lr=0.00117, 34k steps) | 2258 | 2072 | 0.4 | Left/center, oscillating | +| Trial 18 (n_steer=3, n_throttle=5, lr=0.000288, 16k steps) | 2256 | 2072 | 0.4 | Right shoulder, very accurate | + +**Key insight — the track ENDS!** The runs don't time out — the car genuinely completes the full track. The CTE spike at the end is the car reaching the track boundary/finish. + +### Why Different Driving Styles Emerged + +**Action space discretization is the dominant factor:** +- `n_steer=3`: Only LEFT/STRAIGHT/RIGHT → decisive, committed steering → clean lane following +- `n_steer=4`: 4 steer positions → oscillating correction policy (still completes track) +- `n_throttle=5`: More speed granularity → smoother corner negotiation + +**CTE reward symmetry creates multiple valid solutions:** +The reward `base_CTE × efficiency × speed` is symmetric — driving 0.5m left of center = driving 0.5m right of center (same |CTE|). PPO random initialization determines which symmetric solution the model converges to. This is why Trials 20 and 18 drive on opposite sides of the road despite similar hyperparameters. + +**Emergent counterintuitive finding: FEWER steering bins → BETTER driving** +Trial 20 (n_steer=3) outperforms Trial 8 (n_steer=4) both in distance and smoothness. With only 3 steering bins, the model is forced to commit to decisive actions, developing a cleaner driving policy. More action granularity introduced oscillation without improving performance. + +### Can We Control Driving Behaviour? + +Yes! Through targeted reward shaping: +1. **Lane position targeting**: `reward = 1 - abs(cte - target_offset)/max_cte` → bias to specific lane position +2. **Anti-oscillation penalty**: Penalize rapid steering changes → eliminates Model 2 oscillation +3. **Asymmetric CTE**: Penalize left-of-center more → enforces right-lane driving rule +4. **Speed zones**: Reward deceleration before corners (future work) + +### Phase 2 → Phase 3 Transition + +**Phase 2 objective ACHIEVED:** Models complete the full track with genuine learned driving behaviour. + +**Phase 3 objectives:** +- Behavioral control (lane position, oscillation suppression) +- Speed optimization (fastest lap time) +- Multi-track generalization +- Fine-tuning from Phase 2 champion + +**Phase 2 Champion:** Trial 20 — n_steer=3, n_throttle=5, lr=0.000225, 13k steps diff --git a/tests/test_behavioral_wrappers.py b/tests/test_behavioral_wrappers.py new file mode 100644 index 0000000..54ec6bd --- /dev/null +++ b/tests/test_behavioral_wrappers.py @@ -0,0 +1,179 @@ +""" +Tests for behavioral_wrappers.py — no simulator required. +""" + +import sys, os, math, pytest +import numpy as np +import gymnasium as gym + +sys.path.insert(0, os.path.join(os.path.dirname(__file__), '..', 'agent')) +from behavioral_wrappers import LanePositionWrapper, AntiOscillationWrapper, AsymmetricCTEWrapper, CombinedBehavioralWrapper + + +class MockEnv(gym.Env): + metadata = {'render_modes': []} + def __init__(self, reward=0.8, cte=0.0, done=False): + super().__init__() + self.action_space = gym.spaces.Box(low=np.array([-1.0, 0.2]), high=np.array([1.0, 1.0]), dtype=np.float32) + self.observation_space = gym.spaces.Box(0, 255, (120, 160, 3), dtype=np.uint8) + self._reward = reward + self._cte = cte + self._done = done + + def set(self, reward=None, cte=None): + if reward is not None: self._reward = reward + if cte is not None: self._cte = cte + + def reset(self, seed=None, **kwargs): + return np.zeros((120, 160, 3), dtype=np.uint8), {} + + def step(self, action): + obs = np.zeros((120, 160, 3), dtype=np.uint8) + info = {'cte': self._cte, 'speed': 2.0, 'lap_count': 0, 'last_lap_time': 0.0} + return obs, self._reward, self._done, False, info + + def close(self): pass + + +# ---- LanePositionWrapper Tests ---- + +def test_lane_position_bonus_at_target(): + """At the target CTE, position bonus is maximized.""" + env = MockEnv(reward=0.8, cte=-0.5) # Car at CTE=-0.5 + wrapped = LanePositionWrapper(env, target_cte=-0.5, position_weight=0.2) + wrapped.reset() + _, r, _, _, _ = wrapped.step(np.array([0.0, 0.5])) + # Should get max bonus: reward + 0.2 * 1.0 = 1.0 + assert r == pytest.approx(1.0, abs=0.01) + + +def test_lane_position_reduces_reward_away_from_target(): + """Away from target CTE, position bonus is smaller.""" + env_near = MockEnv(reward=0.8, cte=-0.5) + env_far = MockEnv(reward=0.8, cte=2.0) + wrapped_near = LanePositionWrapper(env_near, target_cte=-0.5, position_weight=0.2) + wrapped_far = LanePositionWrapper(env_far, target_cte=-0.5, position_weight=0.2) + wrapped_near.reset() + wrapped_far.reset() + _, r_near, _, _, _ = wrapped_near.step(np.array([0.0, 0.5])) + _, r_far, _, _, _ = wrapped_far.step(np.array([0.0, 0.5])) + assert r_near > r_far + + +def test_lane_position_no_bonus_when_off_track(): + """No position bonus when original reward <= 0 (off track).""" + env = MockEnv(reward=-1.0, cte=0.0) # Crashed, perfect CTE + wrapped = LanePositionWrapper(env, target_cte=0.0, position_weight=0.5) + wrapped.reset() + _, r, _, _, _ = wrapped.step(np.array([0.0, 0.5])) + assert r == -1.0 + + +def test_right_of_centre_target_biases_right(): + """Setting target_cte=-0.5 (right) gives higher reward for right-of-centre.""" + env_right = MockEnv(reward=0.8, cte=-0.5) # Right of centre + env_left = MockEnv(reward=0.8, cte=+0.5) # Left of centre + wrapped_right = LanePositionWrapper(env_right, target_cte=-0.5) + wrapped_left = LanePositionWrapper(env_left, target_cte=-0.5) + wrapped_right.reset() + wrapped_left.reset() + _, r_right, _, _, _ = wrapped_right.step(np.array([0.0, 0.5])) + _, r_left, _, _, _ = wrapped_left.step(np.array([0.0, 0.5])) + assert r_right > r_left, "Right-of-centre should reward more when target_cte is negative" + + +# ---- AntiOscillationWrapper Tests ---- + +def test_no_penalty_on_first_step(): + """No oscillation penalty on the very first step (no previous action).""" + env = MockEnv(reward=0.8) + wrapped = AntiOscillationWrapper(env, oscillation_penalty=0.5) + wrapped.reset() + _, r, _, _, _ = wrapped.step(np.array([1.0, 0.5])) # Large steer — no penalty yet + assert r == pytest.approx(0.8, abs=0.01) + + +def test_large_steering_change_penalised(): + """Rapid steering reversal should get a penalty.""" + env = MockEnv(reward=0.8) + wrapped = AntiOscillationWrapper(env, oscillation_penalty=0.3) + wrapped.reset() + wrapped.step(np.array([-1.0, 0.5])) # Full left + _, r, _, _, _ = wrapped.step(np.array([+1.0, 0.5])) # Full right — delta=2.0 + # Penalty = 0.3 * 2.0 = 0.6 → reward = 0.8 - 0.6 = 0.2 + assert r < 0.8, "Large steering change should be penalised" + assert r == pytest.approx(0.8 - 0.3 * 2.0, abs=0.05) + + +def test_no_steering_change_no_penalty(): + """Consistent steering should get no penalty.""" + env = MockEnv(reward=0.8) + wrapped = AntiOscillationWrapper(env, oscillation_penalty=0.3) + wrapped.reset() + wrapped.step(np.array([0.3, 0.5])) + _, r, _, _, _ = wrapped.step(np.array([0.3, 0.5])) # Same action — delta=0 + assert r == pytest.approx(0.8, abs=0.01) + + +def test_oscillation_penalty_not_applied_off_track(): + """Off-track (negative reward) should not get oscillation penalty.""" + env = MockEnv(reward=-1.0) + wrapped = AntiOscillationWrapper(env, oscillation_penalty=0.5) + wrapped.reset() + wrapped.step(np.array([-1.0, 0.5])) + _, r, _, _, _ = wrapped.step(np.array([+1.0, 0.5])) # Large change, but off-track + assert r == -1.0, "Off-track reward should stay -1.0" + + +def test_oscillation_score_zero_for_consistent_driving(): + """Constant steering → oscillation score ≈ 0.""" + env = MockEnv(reward=0.8) + wrapped = AntiOscillationWrapper(env, oscillation_penalty=0.1) + wrapped.reset() + for _ in range(15): + wrapped.step(np.array([0.2, 0.5])) # Constant steer + assert wrapped.current_oscillation_score() == pytest.approx(0.0, abs=0.01) + + +# ---- AsymmetricCTEWrapper Tests ---- + +def test_left_of_centre_penalised(): + """Left of centre (positive CTE) should earn less reward than right.""" + env_left = MockEnv(reward=0.8, cte=+1.0) + env_right = MockEnv(reward=0.8, cte=-1.0) + wrapped_left = AsymmetricCTEWrapper(env_left) + wrapped_right = AsymmetricCTEWrapper(env_right) + wrapped_left.reset() + wrapped_right.reset() + _, r_left, _, _, _ = wrapped_left.step(np.array([0.0, 0.5])) + _, r_right, _, _, _ = wrapped_right.step(np.array([0.0, 0.5])) + assert r_right > r_left, "Right-of-centre should reward more than left" + + +def test_crash_unaffected_by_asymmetric(): + """Crash (reward=-1) should not be modified.""" + env = MockEnv(reward=-1.0, cte=+2.0) + wrapped = AsymmetricCTEWrapper(env, left_penalty=0.9) + wrapped.reset() + _, r, _, _, _ = wrapped.step(np.array([0.0, 0.5])) + assert r == -1.0 + + +# ---- CombinedBehavioralWrapper Tests ---- + +def test_combined_wrapper_gives_positive_reward_on_track(): + """Combined wrapper should give positive reward when on track.""" + env = MockEnv(reward=0.8, cte=0.0) + wrapped = CombinedBehavioralWrapper(env, target_cte=0.0, oscillation_penalty=0.0) + wrapped.reset() + _, r, _, _, _ = wrapped.step(np.array([0.0, 0.5])) + assert r > 0 + + +def test_combined_wrapper_crash_still_negative(): + """Crash should remain negative through combined wrapper.""" + env = MockEnv(reward=-1.0, cte=0.0) + wrapped = CombinedBehavioralWrapper(env) + wrapped.reset() + _, r, _, _, _ = wrapped.step(np.array([0.0, 0.5])) + assert r < 0