Lab 10 - Governance and Evals

What this adds

Make actions auditable.

At this layer, each action should have a request, policy outcome, result, duration, and evaluation. The goal is not bureaucracy. It is making failures inspectable and success reproducible.

This is also where AI-specific concerns start to resemble the governance patterns people already trust in CI and observability.

Run it

python3 labs/10-governance/eval_runner.py

Artifacts: eval_runner.py and saved_calls.json

Boundary

Logging and evaluation now sit beside execution, not inside the model. That makes tool runs inspectable after the fact.

Done when

You can explain why a run passed, failed, or was blocked without rerunning the agent or guessing what happened.

Real-world analog

OpenTelemetry is a good broad analog: structured traces and logs around execution. The eval layer is the extra AI-specific addition.

Why this matters

AI systems become much easier to trust when you can inspect the record instead of replaying the entire run.

Next lab

Lab 11: wire the stack together.