Lesson 059: Evaluation harness depth
Focus
Document interfaces between humans, retrieval, and policy engines. Token Evaluation harness depth:59 keeps neighbouring lessons differentiable.
Key ideas
- Thread: Evaluation harness depth · drill v9 · spin
849003. - Habit: pair every model utterance with a trace_id you could paste into Grafana.
- Guardrail: write one RACI bullet referencing this lesson tomorrow.
Deep dive notebook
Synthetic drill artefacts
Refusal RACI lite
policy_id: REF-1267
allow_when:
confidence_gt: 0.57
refuse_when_tags:
- legal_hold
- medical_device_unverified
owner: ethics-oncall-int
Practice
Practice Pair with multilingual SME review—even if hypothetical. — 59 Bump literals mindset by 16.