Boundary
Logging and evaluation now sit beside execution, not inside the model. That makes tool runs inspectable after the fact.
Labs / Lab 10
Surround the tool stack with audit records and replayable checks so trust does not depend on rerunning the session.
What this adds
At this layer, each action should have a request, policy outcome, result, duration, and evaluation. The goal is not bureaucracy. It is making failures inspectable and success reproducible.
This is also where AI-specific concerns start to resemble the governance patterns people already trust in CI and observability.
Logging and evaluation now sit beside execution, not inside the model. That makes tool runs inspectable after the fact.
You can explain why a run passed, failed, or was blocked without rerunning the agent or guessing what happened.
OpenTelemetry is a good broad analog: structured traces and logs around execution. The eval layer is the extra AI-specific addition.
AI systems become much easier to trust when you can inspect the record instead of replaying the entire run.