milestone: Phase 1 complete — genuine driving confirmed; launch Phase 2 corner learning

PHASE 1 MILESTONE:
- Champion model drives the track for 599 steps (mean_reward=1022.78, std=0.45)
- Path efficiency 96-100% throughout — genuine forward motion confirmed
- Navigates first right-hand curve successfully
- Fails at S-curve (right->left) at step ~560: speed too high for tight corners
- Root cause: only 4787 training timesteps — model never sees S-curve enough to learn it

PHASE 2 CONFIG (corner learning):
- timesteps: 10,000-50,000 (10x more — model must experience S-curve many times)
- learning_rate: 0.00005-0.002 (tightened around Phase 1 winning region)
- eval_episodes: 5 (more reliable corner stats)
- JOB_TIMEOUT: 3600s (50k steps on CPU needs time)
- Results: autoresearch_results_phase2.jsonl (clean separation from Phase 1)

Research documentation:
- Phase 1 milestone added to docs/RESEARCH_LOG.md
- Full trajectory analysis: start -> first corner -> S-curve crash position logged
- Reward shaping v3 path efficiency victory documented
- evaluate_champion.py added for visual + diagnostic evaluation

Agent: pi/claude-sonnet
Tests: 40/40 passing
Tests-Added: 0
TypeScript: N/A
This commit is contained in:
Paul Huliganga 2026-04-13 19:33:06 -04:00
parent cb82121e98
commit 7b8830f0cb
4 changed files with 260 additions and 11 deletions

View File

@ -39,9 +39,9 @@ RESULTS_DIR = os.path.join(PROJECT_DIR, 'outerloop-results')
MODELS_DIR = os.path.join(PROJECT_DIR, 'models') MODELS_DIR = os.path.join(PROJECT_DIR, 'models')
CHAMPION_DIR = os.path.join(MODELS_DIR, 'champion') CHAMPION_DIR = os.path.join(MODELS_DIR, 'champion')
# Phase 1 uses a separate results file — do NOT mix with random-policy data # Phase 2 uses a separate results file — corner learning with longer timesteps
PHASE1_RESULTS = os.path.join(RESULTS_DIR, 'autoresearch_results_phase1.jsonl') PHASE1_RESULTS = os.path.join(RESULTS_DIR, 'autoresearch_results_phase2.jsonl')
PHASE1_LOG = os.path.join(RESULTS_DIR, 'autoresearch_phase1_log.txt') PHASE1_LOG = os.path.join(RESULTS_DIR, 'autoresearch_phase2_log.txt')
# Legacy base data (discretization insights, valid for n_steer/n_throttle) # Legacy base data (discretization insights, valid for n_steer/n_throttle)
BASE_DATA_FILE = os.path.join(RESULTS_DIR, 'clean_sweep_results.jsonl') BASE_DATA_FILE = os.path.join(RESULTS_DIR, 'clean_sweep_results.jsonl')
@ -52,28 +52,30 @@ os.makedirs(CHAMPION_DIR, exist_ok=True)
# ---- Parameter Space ---- # ---- Parameter Space ----
# These are the parameters GP+UCB will optimize # These are the parameters GP+UCB will optimize
# NOTE: timesteps kept small (1000-5000) for Phase 1 exploration on CPU. # PHASE 2: Corner Learning
# DonkeyCar sim runs ~20-50 steps/sec. 5000 steps ≈ 100-250s → fits in 600s timeout. # Phase 1 confirmed genuine driving (599 steps, mean_reward=1022, efficiency ~99%).
# Increase max_timesteps once we confirm the pipeline works end-to-end. # Failure point: S-curve at step ~560 — too fast, doesn't learn left-turn recovery.
# Fix: Much longer training so model experiences the S-curve many times.
# Search space tightened around Phase 1 winning region: lr=0.00005-0.002, n_throttle=2-5
PARAM_SPACE = { PARAM_SPACE = {
'n_steer': {'type': 'int', 'min': 3, 'max': 9}, 'n_steer': {'type': 'int', 'min': 3, 'max': 9},
'n_throttle': {'type': 'int', 'min': 2, 'max': 5}, 'n_throttle': {'type': 'int', 'min': 2, 'max': 5},
'learning_rate': {'type': 'float', 'min': 0.00005, 'max': 0.005}, 'learning_rate': {'type': 'float', 'min': 0.00005, 'max': 0.002},
'timesteps': {'type': 'int', 'min': 1000, 'max': 5000}, 'timesteps': {'type': 'int', 'min': 10000, 'max': 50000},
} }
PARAM_KEYS = list(PARAM_SPACE.keys()) PARAM_KEYS = list(PARAM_SPACE.keys())
# Fixed params # Fixed params
FIXED_PARAMS = { FIXED_PARAMS = {
'agent': 'ppo', 'agent': 'ppo',
'eval_episodes': 3, 'eval_episodes': 5, # More eval episodes — corner performance is stochastic
'reward_shaping': True, 'reward_shaping': True,
} }
N_CANDIDATES = 500 N_CANDIDATES = 500
UCB_KAPPA = 2.0 UCB_KAPPA = 2.0
MIN_TRIALS_BEFORE_GP = 3 MIN_TRIALS_BEFORE_GP = 3
JOB_TIMEOUT = 480 # 8 minutes — enough for 5000 steps + eval, with margin JOB_TIMEOUT = 3600 # 60 min per trial — 50k steps on CPU needs time
# ---- Logging ---- # ---- Logging ----
def log(msg): def log(msg):
@ -222,7 +224,7 @@ class ChampionTracker:
# ---- Load Results ---- # ---- Load Results ----
def load_phase1_results(): def load_phase1_results():
"""Load Phase 1 results only — no random-policy contamination.""" """Load Phase 2 results for GP fitting (corner learning runs)."""
results = [] results = []
if not os.path.exists(PHASE1_RESULTS): if not os.path.exists(PHASE1_RESULTS):
return results return results

169
agent/evaluate_champion.py Normal file
View File

@ -0,0 +1,169 @@
"""
Champion Model Evaluator
========================
Loads the champion model and runs it live in the simulator for visual inspection.
Prints per-step diagnostics: position, speed, CTE, efficiency, reward.
Usage:
python3 evaluate_champion.py [--episodes N] [--steps N]
Watch the simulator window to see if the car is genuinely driving the track
or exploiting circular motion.
"""
import os
import sys
import time
import json
import numpy as np
from collections import deque
import gymnasium as gym
import gym_donkeycar
from stable_baselines3 import PPO
# Add agent dir to path for wrappers
sys.path.insert(0, os.path.dirname(os.path.abspath(__file__)))
from reward_wrapper import SpeedRewardWrapper
from donkeycar_sb3_runner import ThrottleClampWrapper
CHAMPION_DIR = os.path.join(os.path.dirname(__file__), 'models', 'champion')
MANIFEST_PATH = os.path.join(CHAMPION_DIR, 'manifest.json')
MODEL_PATH = os.path.join(CHAMPION_DIR, 'model.zip')
def load_manifest():
with open(MANIFEST_PATH) as f:
return json.load(f)
def print_banner(manifest):
print('=' * 65, flush=True)
print('🏆 DonkeyCar Champion Model Evaluation', flush=True)
print('=' * 65, flush=True)
print(f" Trial: {manifest['trial']}", flush=True)
print(f" mean_reward: {manifest['mean_reward']:.4f}", flush=True)
print(f" Params: {manifest['params']}", flush=True)
print(f" Model: {MODEL_PATH}", flush=True)
print('=' * 65, flush=True)
print(flush=True)
def compute_efficiency(pos_history):
"""Path efficiency = net_displacement / total_path_length over window."""
if len(pos_history) < 3:
return 1.0
positions = list(pos_history)
net = np.linalg.norm(np.array(positions[-1]) - np.array(positions[0]))
total = sum(
np.linalg.norm(np.array(positions[i+1]) - np.array(positions[i]))
for i in range(len(positions)-1)
)
return float(net / total) if total > 1e-6 else 1.0
def run_episode(model, env, episode_num, max_steps=500):
"""Run one episode with the champion policy, printing diagnostics."""
print(f'\n--- Episode {episode_num} ---', flush=True)
obs, info = env.reset()
pos_history = deque(maxlen=30)
total_reward = 0.0
step = 0
print(f'{"Step":>5} {"Speed":>6} {"CTE":>7} {"Eff%":>6} {"Rwd":>8} {"TotRwd":>10} {"Pos_x":>8} {"Pos_z":>8}', flush=True)
print('-' * 65, flush=True)
while step < max_steps:
action, _ = model.predict(obs, deterministic=True)
result = env.step(action)
if len(result) == 5:
obs, reward, terminated, truncated, info = result
done = terminated or truncated
else:
obs, reward, done, info = result
# Extract diagnostics from info
speed = float(info.get('speed', 0.0) or 0.0)
cte = float(info.get('cte', 0.0) or 0.0)
pos = info.get('pos', None)
if pos is not None:
pos_history.append(list(pos)[:3])
px, pz = pos[0], pos[2] if len(pos) > 2 else 0.0
else:
px, pz = 0.0, 0.0
efficiency = compute_efficiency(pos_history)
total_reward += reward
step += 1
# Print every 10 steps or on done
if step % 10 == 0 or done:
print(f'{step:>5} {speed:>6.2f} {cte:>7.3f} {efficiency*100:>5.1f}% {reward:>8.3f} {total_reward:>10.2f} {px:>8.2f} {pz:>8.2f}', flush=True)
if done:
print(f'\n ✅ Episode {episode_num} done after {step} steps | total_reward={total_reward:.2f}', flush=True)
break
if step >= max_steps:
print(f'\n ⏱️ Episode {episode_num} reached max_steps={max_steps} | total_reward={total_reward:.2f}', flush=True)
return total_reward, step
def main(episodes=3, max_steps=500):
manifest = load_manifest()
print_banner(manifest)
params = manifest['params']
print(f'[Eval] Connecting to simulator...', flush=True)
try:
env = gym.make('donkey-generated-roads-v0')
except Exception as e:
print(f'[Eval] FAILED to connect: {e}', flush=True)
sys.exit(1)
# Apply same wrappers as training
env = ThrottleClampWrapper(env, throttle_min=0.2)
env = SpeedRewardWrapper(env, speed_scale=0.1)
print(f'[Eval] Wrappers applied: ThrottleClamp(min=0.2), SpeedRewardWrapper(scale=0.1)', flush=True)
print(f'[Eval] Loading champion model from {MODEL_PATH}...', flush=True)
try:
model = PPO.load(MODEL_PATH, env=env)
print(f'[Eval] Model loaded successfully.', flush=True)
except Exception as e:
print(f'[Eval] FAILED to load model: {e}', flush=True)
env.close()
sys.exit(1)
print(f'\n[Eval] Running {episodes} episodes (max {max_steps} steps each)...', flush=True)
print('[Eval] Watch the simulator window — is the car driving the track or circling?', flush=True)
all_rewards = []
for ep in range(1, episodes + 1):
total_reward, steps = run_episode(model, env, ep, max_steps=max_steps)
all_rewards.append(total_reward)
if ep < episodes:
time.sleep(2) # Brief pause between episodes
print('\n' + '=' * 65, flush=True)
print('📊 Evaluation Complete', flush=True)
print(f' Episodes: {episodes}', flush=True)
print(f' Rewards: {[f"{r:.1f}" for r in all_rewards]}', flush=True)
print(f' Mean reward: {sum(all_rewards)/len(all_rewards):.2f}', flush=True)
print(f' Std reward: {float(np.std(all_rewards)):.2f}', flush=True)
print('=' * 65, flush=True)
env.close()
time.sleep(2)
print('[Eval] Done.', flush=True)
if __name__ == '__main__':
import argparse
parser = argparse.ArgumentParser()
parser.add_argument('--episodes', type=int, default=3, help='Number of eval episodes')
parser.add_argument('--steps', type=int, default=500, help='Max steps per episode')
args = parser.parse_args()
main(episodes=args.episodes, max_steps=args.steps)

View File

@ -1991,3 +1991,4 @@
[2026-04-13 19:18:00] mean_reward=3332.0024 params={'n_steer': 4, 'n_throttle': 3, 'learning_rate': 0.0010146909128518657, 'timesteps': 4979, 'agent': 'ppo', 'eval_episodes': 3, 'reward_shaping': True} [2026-04-13 19:18:00] mean_reward=3332.0024 params={'n_steer': 4, 'n_throttle': 3, 'learning_rate': 0.0010146909128518657, 'timesteps': 4979, 'agent': 'ppo', 'eval_episodes': 3, 'reward_shaping': True}
[2026-04-13 19:18:00] mean_reward=2306.7610 params={'n_steer': 5, 'n_throttle': 3, 'learning_rate': 0.0004488352572615814, 'timesteps': 4898, 'agent': 'ppo', 'eval_episodes': 3, 'reward_shaping': True} [2026-04-13 19:18:00] mean_reward=2306.7610 params={'n_steer': 5, 'n_throttle': 3, 'learning_rate': 0.0004488352572615814, 'timesteps': 4898, 'agent': 'ppo', 'eval_episodes': 3, 'reward_shaping': True}
[2026-04-13 19:18:00] mean_reward=2286.9085 params={'n_steer': 5, 'n_throttle': 3, 'learning_rate': 0.0003386484278685721, 'timesteps': 4977, 'agent': 'ppo', 'eval_episodes': 3, 'reward_shaping': True} [2026-04-13 19:18:00] mean_reward=2286.9085 params={'n_steer': 5, 'n_throttle': 3, 'learning_rate': 0.0003386484278685721, 'timesteps': 4977, 'agent': 'ppo', 'eval_episodes': 3, 'reward_shaping': True}
[2026-04-13 19:18:01] [AutoResearch] Git push complete after trial 50

View File

@ -247,3 +247,80 @@ shaped_reward = original_reward × (1 + speed_scale × speed × efficiency)
| > 50% | Unstable policy, inconsistent behavior | | > 50% | Unstable policy, inconsistent behavior |
This metric will be added to the autoresearch result logging and summary. This metric will be added to the autoresearch result logging and summary.
---
## 2026-04-13 — 🏆 PHASE 1 MILESTONE: Genuine Track Driving Confirmed!
### Finding: Champion Model Drives the Track — Real RL Behaviour Proven
**This is the first confirmed genuine driving result from the autoresearch pipeline.**
**Visual confirmation (user):** "It is definitely driving! The donkeycar is driving along the track!"
**Evaluation data — 3 episodes, 1500 max steps:**
| Episode | Steps | Total Reward | Std | Efficiency |
|---------|-------|-------------|-------|------------|
| 1 | 599 | 1022.73 | — | 96-100% |
| 2 | 598 | 1023.35 | — | 96-100% |
| 3 | 599 | 1022.25 | — | 96-100% |
| **Mean** | **599** | **1022.78** | **0.45** | **~99%** |
**Champion Model Parameters:**
- agent: PPO, n_steer=7, n_throttle=3, lr=0.000680, timesteps=4787
- Path: `agent/models/champion/model.zip`
### Track Trajectory Analysis
```
Start: Pos(6.25, 6.30) → Starting line
Step 300: Pos(22.80, 2.09) → Long straight, approaching first corner
Step 400: Pos(18.80, -6.96) → Negotiating first right-hand curve ✅
Step 500: Pos(28.12, -5.61) → Continuing along second straight
Step 560: Pos(33.12, -6.55) → Approaching second corner
Step 599: CRASH CTE=8.26 → Off track at second corner ❌
```
The car successfully:
- Accelerates from 0 → 2.3 m/s along the straight
- Navigates the first right-hand curve
- Follows the track for ~600 steps covering ~30+ position units
### Failure Analysis: The S-Curve Crash
**User observation:** "The spot where the donkeycar goes off the track is during a right hand curve which quickly turns into a left hand curve. It doesn't even look like it sees the left hand curve."
**What the data shows:**
- Steps 540-560: CTE briefly near zero (0.24) — car approaches corner well
- Steps 570+: CTE explodes 1.4 → 3.8 → 5.9 → 8.3 — car overshoots
- Speed at crash: 2.23-2.30 m/s — too fast for the S-curve
**Root cause:** Only 4787 training timesteps — insufficient to learn:
1. Speed reduction approaching corners
2. Left-turn recovery after right-hand overshoot
3. S-curve geometry (right → quick left transition)
**Key insight: The model never sees the left-hand curve** because it has always crashed at the right-hand part first during training. This is an exploration problem — the car needs more timesteps to get past this point and discover what's beyond.
### Reward Shaping Victory
All 3 reward hacking fixes proved necessary and correct:
- v1 additive → boundary oscillation exploit
- v2 multiplicative → circular driving exploit
- v3 path efficiency → genuine forward driving ✅
The path efficiency metric (96-100% throughout entire run) confirms the car is making continuous forward progress — not circling, not oscillating.
### Phase 1 → Phase 2 Transition
**Phase 1 objective achieved:** A real PPO model drives the DonkeyCar track with genuine forward motion, consistent behaviour (std=0.45), and correct trajectory.
**Next objective (targeted autoresearch):** Learn corner handling and speed modulation.
- Increase timesteps to 10,000-50,000 per trial
- The model needs to see the S-curve many times to learn the transition
- Consider adding a CTE-rate-of-change penalty to discourage high speed at high CTE
### This is Research!
The reward hacking discovery and the progression from random walk → boundary oscillation → circular exploit → genuine driving represents real empirical RL research. Each failure mode revealed a fundamental property of reward design. The path efficiency fix was an original contribution to solving the circular driving problem without requiring track-shape knowledge.