Lesson 054: Evaluation harness depth
Focus
Bias toward observable metrics, not model marketing. Token Evaluation harness depth:54 keeps neighbouring lessons differentiable.
Key ideas
- Thread: Evaluation harness depth · drill v4 · spin
878158. - Habit: pair every model utterance with a trace_id you could paste into Grafana.
- Guardrail: write one RACI bullet referencing this lesson tomorrow.
Deep dive notebook
Synthetic drill artefacts
Agent choreography card
1. Observe transcripts bucket `BUCKET-4`
2. Budget steps `8`
3. Tool whitelist: `retrieve_docs, escalate_human, log_decision`
4. Hard stop triggers: hallucination_budget | escalation keyword `URGENT-5`
Practice
Practice Attach rollback steps if evaluator variance spikes. — 54 Bump literals mindset by 28.