milestone: Phase 1 complete — genuine driving confirmed; launch Phase 2 corner learning
PHASE 1 MILESTONE: - Champion model drives the track for 599 steps (mean_reward=1022.78, std=0.45) - Path efficiency 96-100% throughout — genuine forward motion confirmed - Navigates first right-hand curve successfully - Fails at S-curve (right->left) at step ~560: speed too high for tight corners - Root cause: only 4787 training timesteps — model never sees S-curve enough to learn it PHASE 2 CONFIG (corner learning): - timesteps: 10,000-50,000 (10x more — model must experience S-curve many times) - learning_rate: 0.00005-0.002 (tightened around Phase 1 winning region) - eval_episodes: 5 (more reliable corner stats) - JOB_TIMEOUT: 3600s (50k steps on CPU needs time) - Results: autoresearch_results_phase2.jsonl (clean separation from Phase 1) Research documentation: - Phase 1 milestone added to docs/RESEARCH_LOG.md - Full trajectory analysis: start -> first corner -> S-curve crash position logged - Reward shaping v3 path efficiency victory documented - evaluate_champion.py added for visual + diagnostic evaluation Agent: pi/claude-sonnet Tests: 40/40 passing Tests-Added: 0 TypeScript: N/A
This commit is contained in:
parent
cb82121e98
commit
7b8830f0cb
|
|
@ -39,9 +39,9 @@ RESULTS_DIR = os.path.join(PROJECT_DIR, 'outerloop-results')
|
||||||
MODELS_DIR = os.path.join(PROJECT_DIR, 'models')
|
MODELS_DIR = os.path.join(PROJECT_DIR, 'models')
|
||||||
CHAMPION_DIR = os.path.join(MODELS_DIR, 'champion')
|
CHAMPION_DIR = os.path.join(MODELS_DIR, 'champion')
|
||||||
|
|
||||||
# Phase 1 uses a separate results file — do NOT mix with random-policy data
|
# Phase 2 uses a separate results file — corner learning with longer timesteps
|
||||||
PHASE1_RESULTS = os.path.join(RESULTS_DIR, 'autoresearch_results_phase1.jsonl')
|
PHASE1_RESULTS = os.path.join(RESULTS_DIR, 'autoresearch_results_phase2.jsonl')
|
||||||
PHASE1_LOG = os.path.join(RESULTS_DIR, 'autoresearch_phase1_log.txt')
|
PHASE1_LOG = os.path.join(RESULTS_DIR, 'autoresearch_phase2_log.txt')
|
||||||
|
|
||||||
# Legacy base data (discretization insights, valid for n_steer/n_throttle)
|
# Legacy base data (discretization insights, valid for n_steer/n_throttle)
|
||||||
BASE_DATA_FILE = os.path.join(RESULTS_DIR, 'clean_sweep_results.jsonl')
|
BASE_DATA_FILE = os.path.join(RESULTS_DIR, 'clean_sweep_results.jsonl')
|
||||||
|
|
@ -52,28 +52,30 @@ os.makedirs(CHAMPION_DIR, exist_ok=True)
|
||||||
|
|
||||||
# ---- Parameter Space ----
|
# ---- Parameter Space ----
|
||||||
# These are the parameters GP+UCB will optimize
|
# These are the parameters GP+UCB will optimize
|
||||||
# NOTE: timesteps kept small (1000-5000) for Phase 1 exploration on CPU.
|
# PHASE 2: Corner Learning
|
||||||
# DonkeyCar sim runs ~20-50 steps/sec. 5000 steps ≈ 100-250s → fits in 600s timeout.
|
# Phase 1 confirmed genuine driving (599 steps, mean_reward=1022, efficiency ~99%).
|
||||||
# Increase max_timesteps once we confirm the pipeline works end-to-end.
|
# Failure point: S-curve at step ~560 — too fast, doesn't learn left-turn recovery.
|
||||||
|
# Fix: Much longer training so model experiences the S-curve many times.
|
||||||
|
# Search space tightened around Phase 1 winning region: lr=0.00005-0.002, n_throttle=2-5
|
||||||
PARAM_SPACE = {
|
PARAM_SPACE = {
|
||||||
'n_steer': {'type': 'int', 'min': 3, 'max': 9},
|
'n_steer': {'type': 'int', 'min': 3, 'max': 9},
|
||||||
'n_throttle': {'type': 'int', 'min': 2, 'max': 5},
|
'n_throttle': {'type': 'int', 'min': 2, 'max': 5},
|
||||||
'learning_rate': {'type': 'float', 'min': 0.00005, 'max': 0.005},
|
'learning_rate': {'type': 'float', 'min': 0.00005, 'max': 0.002},
|
||||||
'timesteps': {'type': 'int', 'min': 1000, 'max': 5000},
|
'timesteps': {'type': 'int', 'min': 10000, 'max': 50000},
|
||||||
}
|
}
|
||||||
PARAM_KEYS = list(PARAM_SPACE.keys())
|
PARAM_KEYS = list(PARAM_SPACE.keys())
|
||||||
|
|
||||||
# Fixed params
|
# Fixed params
|
||||||
FIXED_PARAMS = {
|
FIXED_PARAMS = {
|
||||||
'agent': 'ppo',
|
'agent': 'ppo',
|
||||||
'eval_episodes': 3,
|
'eval_episodes': 5, # More eval episodes — corner performance is stochastic
|
||||||
'reward_shaping': True,
|
'reward_shaping': True,
|
||||||
}
|
}
|
||||||
|
|
||||||
N_CANDIDATES = 500
|
N_CANDIDATES = 500
|
||||||
UCB_KAPPA = 2.0
|
UCB_KAPPA = 2.0
|
||||||
MIN_TRIALS_BEFORE_GP = 3
|
MIN_TRIALS_BEFORE_GP = 3
|
||||||
JOB_TIMEOUT = 480 # 8 minutes — enough for 5000 steps + eval, with margin
|
JOB_TIMEOUT = 3600 # 60 min per trial — 50k steps on CPU needs time
|
||||||
|
|
||||||
# ---- Logging ----
|
# ---- Logging ----
|
||||||
def log(msg):
|
def log(msg):
|
||||||
|
|
@ -222,7 +224,7 @@ class ChampionTracker:
|
||||||
|
|
||||||
# ---- Load Results ----
|
# ---- Load Results ----
|
||||||
def load_phase1_results():
|
def load_phase1_results():
|
||||||
"""Load Phase 1 results only — no random-policy contamination."""
|
"""Load Phase 2 results for GP fitting (corner learning runs)."""
|
||||||
results = []
|
results = []
|
||||||
if not os.path.exists(PHASE1_RESULTS):
|
if not os.path.exists(PHASE1_RESULTS):
|
||||||
return results
|
return results
|
||||||
|
|
|
||||||
|
|
@ -0,0 +1,169 @@
|
||||||
|
"""
|
||||||
|
Champion Model Evaluator
|
||||||
|
========================
|
||||||
|
Loads the champion model and runs it live in the simulator for visual inspection.
|
||||||
|
Prints per-step diagnostics: position, speed, CTE, efficiency, reward.
|
||||||
|
|
||||||
|
Usage:
|
||||||
|
python3 evaluate_champion.py [--episodes N] [--steps N]
|
||||||
|
|
||||||
|
Watch the simulator window to see if the car is genuinely driving the track
|
||||||
|
or exploiting circular motion.
|
||||||
|
"""
|
||||||
|
|
||||||
|
import os
|
||||||
|
import sys
|
||||||
|
import time
|
||||||
|
import json
|
||||||
|
import numpy as np
|
||||||
|
from collections import deque
|
||||||
|
|
||||||
|
import gymnasium as gym
|
||||||
|
import gym_donkeycar
|
||||||
|
from stable_baselines3 import PPO
|
||||||
|
|
||||||
|
# Add agent dir to path for wrappers
|
||||||
|
sys.path.insert(0, os.path.dirname(os.path.abspath(__file__)))
|
||||||
|
from reward_wrapper import SpeedRewardWrapper
|
||||||
|
from donkeycar_sb3_runner import ThrottleClampWrapper
|
||||||
|
|
||||||
|
CHAMPION_DIR = os.path.join(os.path.dirname(__file__), 'models', 'champion')
|
||||||
|
MANIFEST_PATH = os.path.join(CHAMPION_DIR, 'manifest.json')
|
||||||
|
MODEL_PATH = os.path.join(CHAMPION_DIR, 'model.zip')
|
||||||
|
|
||||||
|
|
||||||
|
def load_manifest():
|
||||||
|
with open(MANIFEST_PATH) as f:
|
||||||
|
return json.load(f)
|
||||||
|
|
||||||
|
|
||||||
|
def print_banner(manifest):
|
||||||
|
print('=' * 65, flush=True)
|
||||||
|
print('🏆 DonkeyCar Champion Model Evaluation', flush=True)
|
||||||
|
print('=' * 65, flush=True)
|
||||||
|
print(f" Trial: {manifest['trial']}", flush=True)
|
||||||
|
print(f" mean_reward: {manifest['mean_reward']:.4f}", flush=True)
|
||||||
|
print(f" Params: {manifest['params']}", flush=True)
|
||||||
|
print(f" Model: {MODEL_PATH}", flush=True)
|
||||||
|
print('=' * 65, flush=True)
|
||||||
|
print(flush=True)
|
||||||
|
|
||||||
|
|
||||||
|
def compute_efficiency(pos_history):
|
||||||
|
"""Path efficiency = net_displacement / total_path_length over window."""
|
||||||
|
if len(pos_history) < 3:
|
||||||
|
return 1.0
|
||||||
|
positions = list(pos_history)
|
||||||
|
net = np.linalg.norm(np.array(positions[-1]) - np.array(positions[0]))
|
||||||
|
total = sum(
|
||||||
|
np.linalg.norm(np.array(positions[i+1]) - np.array(positions[i]))
|
||||||
|
for i in range(len(positions)-1)
|
||||||
|
)
|
||||||
|
return float(net / total) if total > 1e-6 else 1.0
|
||||||
|
|
||||||
|
|
||||||
|
def run_episode(model, env, episode_num, max_steps=500):
|
||||||
|
"""Run one episode with the champion policy, printing diagnostics."""
|
||||||
|
print(f'\n--- Episode {episode_num} ---', flush=True)
|
||||||
|
obs, info = env.reset()
|
||||||
|
pos_history = deque(maxlen=30)
|
||||||
|
total_reward = 0.0
|
||||||
|
step = 0
|
||||||
|
|
||||||
|
print(f'{"Step":>5} {"Speed":>6} {"CTE":>7} {"Eff%":>6} {"Rwd":>8} {"TotRwd":>10} {"Pos_x":>8} {"Pos_z":>8}', flush=True)
|
||||||
|
print('-' * 65, flush=True)
|
||||||
|
|
||||||
|
while step < max_steps:
|
||||||
|
action, _ = model.predict(obs, deterministic=True)
|
||||||
|
result = env.step(action)
|
||||||
|
if len(result) == 5:
|
||||||
|
obs, reward, terminated, truncated, info = result
|
||||||
|
done = terminated or truncated
|
||||||
|
else:
|
||||||
|
obs, reward, done, info = result
|
||||||
|
|
||||||
|
# Extract diagnostics from info
|
||||||
|
speed = float(info.get('speed', 0.0) or 0.0)
|
||||||
|
cte = float(info.get('cte', 0.0) or 0.0)
|
||||||
|
pos = info.get('pos', None)
|
||||||
|
if pos is not None:
|
||||||
|
pos_history.append(list(pos)[:3])
|
||||||
|
px, pz = pos[0], pos[2] if len(pos) > 2 else 0.0
|
||||||
|
else:
|
||||||
|
px, pz = 0.0, 0.0
|
||||||
|
|
||||||
|
efficiency = compute_efficiency(pos_history)
|
||||||
|
total_reward += reward
|
||||||
|
step += 1
|
||||||
|
|
||||||
|
# Print every 10 steps or on done
|
||||||
|
if step % 10 == 0 or done:
|
||||||
|
print(f'{step:>5} {speed:>6.2f} {cte:>7.3f} {efficiency*100:>5.1f}% {reward:>8.3f} {total_reward:>10.2f} {px:>8.2f} {pz:>8.2f}', flush=True)
|
||||||
|
|
||||||
|
if done:
|
||||||
|
print(f'\n ✅ Episode {episode_num} done after {step} steps | total_reward={total_reward:.2f}', flush=True)
|
||||||
|
break
|
||||||
|
|
||||||
|
if step >= max_steps:
|
||||||
|
print(f'\n ⏱️ Episode {episode_num} reached max_steps={max_steps} | total_reward={total_reward:.2f}', flush=True)
|
||||||
|
|
||||||
|
return total_reward, step
|
||||||
|
|
||||||
|
|
||||||
|
def main(episodes=3, max_steps=500):
|
||||||
|
manifest = load_manifest()
|
||||||
|
print_banner(manifest)
|
||||||
|
|
||||||
|
params = manifest['params']
|
||||||
|
|
||||||
|
print(f'[Eval] Connecting to simulator...', flush=True)
|
||||||
|
try:
|
||||||
|
env = gym.make('donkey-generated-roads-v0')
|
||||||
|
except Exception as e:
|
||||||
|
print(f'[Eval] FAILED to connect: {e}', flush=True)
|
||||||
|
sys.exit(1)
|
||||||
|
|
||||||
|
# Apply same wrappers as training
|
||||||
|
env = ThrottleClampWrapper(env, throttle_min=0.2)
|
||||||
|
env = SpeedRewardWrapper(env, speed_scale=0.1)
|
||||||
|
print(f'[Eval] Wrappers applied: ThrottleClamp(min=0.2), SpeedRewardWrapper(scale=0.1)', flush=True)
|
||||||
|
|
||||||
|
print(f'[Eval] Loading champion model from {MODEL_PATH}...', flush=True)
|
||||||
|
try:
|
||||||
|
model = PPO.load(MODEL_PATH, env=env)
|
||||||
|
print(f'[Eval] Model loaded successfully.', flush=True)
|
||||||
|
except Exception as e:
|
||||||
|
print(f'[Eval] FAILED to load model: {e}', flush=True)
|
||||||
|
env.close()
|
||||||
|
sys.exit(1)
|
||||||
|
|
||||||
|
print(f'\n[Eval] Running {episodes} episodes (max {max_steps} steps each)...', flush=True)
|
||||||
|
print('[Eval] Watch the simulator window — is the car driving the track or circling?', flush=True)
|
||||||
|
|
||||||
|
all_rewards = []
|
||||||
|
for ep in range(1, episodes + 1):
|
||||||
|
total_reward, steps = run_episode(model, env, ep, max_steps=max_steps)
|
||||||
|
all_rewards.append(total_reward)
|
||||||
|
if ep < episodes:
|
||||||
|
time.sleep(2) # Brief pause between episodes
|
||||||
|
|
||||||
|
print('\n' + '=' * 65, flush=True)
|
||||||
|
print('📊 Evaluation Complete', flush=True)
|
||||||
|
print(f' Episodes: {episodes}', flush=True)
|
||||||
|
print(f' Rewards: {[f"{r:.1f}" for r in all_rewards]}', flush=True)
|
||||||
|
print(f' Mean reward: {sum(all_rewards)/len(all_rewards):.2f}', flush=True)
|
||||||
|
print(f' Std reward: {float(np.std(all_rewards)):.2f}', flush=True)
|
||||||
|
print('=' * 65, flush=True)
|
||||||
|
|
||||||
|
env.close()
|
||||||
|
time.sleep(2)
|
||||||
|
print('[Eval] Done.', flush=True)
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == '__main__':
|
||||||
|
import argparse
|
||||||
|
parser = argparse.ArgumentParser()
|
||||||
|
parser.add_argument('--episodes', type=int, default=3, help='Number of eval episodes')
|
||||||
|
parser.add_argument('--steps', type=int, default=500, help='Max steps per episode')
|
||||||
|
args = parser.parse_args()
|
||||||
|
main(episodes=args.episodes, max_steps=args.steps)
|
||||||
|
|
@ -1991,3 +1991,4 @@
|
||||||
[2026-04-13 19:18:00] mean_reward=3332.0024 params={'n_steer': 4, 'n_throttle': 3, 'learning_rate': 0.0010146909128518657, 'timesteps': 4979, 'agent': 'ppo', 'eval_episodes': 3, 'reward_shaping': True}
|
[2026-04-13 19:18:00] mean_reward=3332.0024 params={'n_steer': 4, 'n_throttle': 3, 'learning_rate': 0.0010146909128518657, 'timesteps': 4979, 'agent': 'ppo', 'eval_episodes': 3, 'reward_shaping': True}
|
||||||
[2026-04-13 19:18:00] mean_reward=2306.7610 params={'n_steer': 5, 'n_throttle': 3, 'learning_rate': 0.0004488352572615814, 'timesteps': 4898, 'agent': 'ppo', 'eval_episodes': 3, 'reward_shaping': True}
|
[2026-04-13 19:18:00] mean_reward=2306.7610 params={'n_steer': 5, 'n_throttle': 3, 'learning_rate': 0.0004488352572615814, 'timesteps': 4898, 'agent': 'ppo', 'eval_episodes': 3, 'reward_shaping': True}
|
||||||
[2026-04-13 19:18:00] mean_reward=2286.9085 params={'n_steer': 5, 'n_throttle': 3, 'learning_rate': 0.0003386484278685721, 'timesteps': 4977, 'agent': 'ppo', 'eval_episodes': 3, 'reward_shaping': True}
|
[2026-04-13 19:18:00] mean_reward=2286.9085 params={'n_steer': 5, 'n_throttle': 3, 'learning_rate': 0.0003386484278685721, 'timesteps': 4977, 'agent': 'ppo', 'eval_episodes': 3, 'reward_shaping': True}
|
||||||
|
[2026-04-13 19:18:01] [AutoResearch] Git push complete after trial 50
|
||||||
|
|
|
||||||
|
|
@ -247,3 +247,80 @@ shaped_reward = original_reward × (1 + speed_scale × speed × efficiency)
|
||||||
| > 50% | Unstable policy, inconsistent behavior |
|
| > 50% | Unstable policy, inconsistent behavior |
|
||||||
|
|
||||||
This metric will be added to the autoresearch result logging and summary.
|
This metric will be added to the autoresearch result logging and summary.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 2026-04-13 — 🏆 PHASE 1 MILESTONE: Genuine Track Driving Confirmed!
|
||||||
|
|
||||||
|
### Finding: Champion Model Drives the Track — Real RL Behaviour Proven
|
||||||
|
|
||||||
|
**This is the first confirmed genuine driving result from the autoresearch pipeline.**
|
||||||
|
|
||||||
|
**Visual confirmation (user):** "It is definitely driving! The donkeycar is driving along the track!"
|
||||||
|
|
||||||
|
**Evaluation data — 3 episodes, 1500 max steps:**
|
||||||
|
|
||||||
|
| Episode | Steps | Total Reward | Std | Efficiency |
|
||||||
|
|---------|-------|-------------|-------|------------|
|
||||||
|
| 1 | 599 | 1022.73 | — | 96-100% |
|
||||||
|
| 2 | 598 | 1023.35 | — | 96-100% |
|
||||||
|
| 3 | 599 | 1022.25 | — | 96-100% |
|
||||||
|
| **Mean** | **599** | **1022.78** | **0.45** | **~99%** |
|
||||||
|
|
||||||
|
**Champion Model Parameters:**
|
||||||
|
- agent: PPO, n_steer=7, n_throttle=3, lr=0.000680, timesteps=4787
|
||||||
|
- Path: `agent/models/champion/model.zip`
|
||||||
|
|
||||||
|
### Track Trajectory Analysis
|
||||||
|
|
||||||
|
```
|
||||||
|
Start: Pos(6.25, 6.30) → Starting line
|
||||||
|
Step 300: Pos(22.80, 2.09) → Long straight, approaching first corner
|
||||||
|
Step 400: Pos(18.80, -6.96) → Negotiating first right-hand curve ✅
|
||||||
|
Step 500: Pos(28.12, -5.61) → Continuing along second straight
|
||||||
|
Step 560: Pos(33.12, -6.55) → Approaching second corner
|
||||||
|
Step 599: CRASH CTE=8.26 → Off track at second corner ❌
|
||||||
|
```
|
||||||
|
|
||||||
|
The car successfully:
|
||||||
|
- Accelerates from 0 → 2.3 m/s along the straight
|
||||||
|
- Navigates the first right-hand curve
|
||||||
|
- Follows the track for ~600 steps covering ~30+ position units
|
||||||
|
|
||||||
|
### Failure Analysis: The S-Curve Crash
|
||||||
|
|
||||||
|
**User observation:** "The spot where the donkeycar goes off the track is during a right hand curve which quickly turns into a left hand curve. It doesn't even look like it sees the left hand curve."
|
||||||
|
|
||||||
|
**What the data shows:**
|
||||||
|
- Steps 540-560: CTE briefly near zero (0.24) — car approaches corner well
|
||||||
|
- Steps 570+: CTE explodes 1.4 → 3.8 → 5.9 → 8.3 — car overshoots
|
||||||
|
- Speed at crash: 2.23-2.30 m/s — too fast for the S-curve
|
||||||
|
|
||||||
|
**Root cause:** Only 4787 training timesteps — insufficient to learn:
|
||||||
|
1. Speed reduction approaching corners
|
||||||
|
2. Left-turn recovery after right-hand overshoot
|
||||||
|
3. S-curve geometry (right → quick left transition)
|
||||||
|
|
||||||
|
**Key insight: The model never sees the left-hand curve** because it has always crashed at the right-hand part first during training. This is an exploration problem — the car needs more timesteps to get past this point and discover what's beyond.
|
||||||
|
|
||||||
|
### Reward Shaping Victory
|
||||||
|
|
||||||
|
All 3 reward hacking fixes proved necessary and correct:
|
||||||
|
- v1 additive → boundary oscillation exploit
|
||||||
|
- v2 multiplicative → circular driving exploit
|
||||||
|
- v3 path efficiency → genuine forward driving ✅
|
||||||
|
|
||||||
|
The path efficiency metric (96-100% throughout entire run) confirms the car is making continuous forward progress — not circling, not oscillating.
|
||||||
|
|
||||||
|
### Phase 1 → Phase 2 Transition
|
||||||
|
|
||||||
|
**Phase 1 objective achieved:** A real PPO model drives the DonkeyCar track with genuine forward motion, consistent behaviour (std=0.45), and correct trajectory.
|
||||||
|
|
||||||
|
**Next objective (targeted autoresearch):** Learn corner handling and speed modulation.
|
||||||
|
- Increase timesteps to 10,000-50,000 per trial
|
||||||
|
- The model needs to see the S-curve many times to learn the transition
|
||||||
|
- Consider adding a CTE-rate-of-change penalty to discourage high speed at high CTE
|
||||||
|
|
||||||
|
### This is Research!
|
||||||
|
|
||||||
|
The reward hacking discovery and the progression from random walk → boundary oscillation → circular exploit → genuine driving represents real empirical RL research. Each failure mode revealed a fundamental property of reward design. The path efficiency fix was an original contribution to solving the circular driving problem without requiring track-shape knowledge.
|
||||||
|
|
|
||||||
Loading…
Reference in New Issue