Engineering notes from production-grade, regulated AI systems.
Memos, post-mortems, and methodology drops written by the engineers who shipped the work — eval suites, drift dashboards, red-team probes, and routing economics from legal, medical, financial, and insurance deployments. Nothing is published unless it cleared an adversarial second-engineer review.
Memos written next to a running eval suite, not a marketing brief.
Each entry below was authored during a live engagement, signed by the engineer who held the pager, and reviewed by a second engineer who tried to break the claim. Customer-identifying detail is redacted; the methodology and the numbers are not.
RFC-014
2026-04-22
author: platform-eng
domain: underwriting
Cascading Haiku → Opus on insurance underwriting: 41% spend recovery without faithfulness loss
Routing layer that defers to the larger model only when first-pass confidence drops below 0.78. We share the eval matrix, the cost-per-decision math, and the two failure modes we hit before stabilising on a token-budget guard.
Schema-strict outputs for privileged-document review: why JSON-mode is the floor, not the ceiling
When a redaction agent fabricated a Bates range that did not exist, JSON-mode was the gate that caught it. We document the strict-schema harness we wrap around every legal-domain agent and the three test families it exposes.
Prompt-injection class r3: indirect attacks via reference documents in HIPAA workloads
Two leakage paths surfaced when intake summaries were pasted from upstream systems containing adversarial markdown. We publish the probe pack, the regression cases, and the policy-layer mitigation that closes both.
Embedding-KL drift over 90 days: a calmer alarm than perplexity for multi-agent fraud queues
Perplexity flapped on holiday traffic; embedding KL did not. We share the drift dashboard, the alert thresholds we settled on, and the playbook when KL exceeds 0.05 against the production reference window.
Decision tracing through a 7-agent claims pipeline: the trace ID is the new request ID
How we propagate a single trace through router, retriever, validator, summariser, classifier, escalation, and audit agents — and how the trace becomes the contract that the eval suite, the SOC 2 auditor, and the on-call engineer all read.
GDPR + HIPAA + SOC 2 with one evidence pipeline: keeping auditors and engineers on the same artefact
We rebuilt our control evidence so that the same eval-run record satisfies the model-risk auditor, the privacy officer, and the security reviewer. The artefact, the retention policy, and the gotchas we hit are documented here.
Six surfaces where production AI either holds or breaks.
We publish where we have shipped — and only where we have shipped. If a topic is missing, it is missing on purpose: we have not yet done the work in production to claim it.
{ 01 }
Eval design
Domain-specific suites for legal, medical, financial, insurance.
Faithfulness, schema, PII, and cost-per-decision as first-class checks.
Token-budget guards that keep p95 stable under fan-out.
Cost-per-decision as a shipped metric, not a slide.
{ 04 }
Drift & observability
Embedding-KL over moving production windows.
Decision traces that an auditor and an SRE can both read.
Quiet alarms over loud ones.
{ 05 }
Multi-agent systems
Contracts between agents written as schemas, not prose.
Failure isolation when one tool returns garbage.
Replay harnesses for incidents that crossed five hops.
{ 06 }
Compliance posture
GDPR, HIPAA, SOC 2 mapped to the same evidence artefact.
Retention, lineage, and reviewer access as engineering work.
Auditor-ready before the auditor arrives.
04 / process/how a memo is born
The pipeline that decides whether an entry ships.
Notes follow the same gate that production code follows at TestML: artefacts, adversarial review, redaction, and an engineer's name on the byline. No ghostwriting, no marketing rewrites, no claims without a captured eval run behind them.
step 01
engagement.start
Every memo begins as a production engagement. We write nothing we have not shipped behind a SOC 2-controlled boundary on a real customer workload.
step 02
evidence.capture
Eval runs, drift snapshots, red-team probes, and routing economics are captured as artefacts during the engagement — not reconstructed afterwards from memory.
step 03
review.adversarial
A second engineer who was not on the engagement attempts to break the claim. If the claim survives, the note moves to draft. If it does not, the note is rewritten or shelved.
step 04
redact.publish
We redact customer-identifying detail, keep the methodology and the numbers, and publish under the engineer who ran the work. No ghostwriting, no marketing rewrites.
Currently shipping
A long-form note on cascading routers under regulated workloads.
We are working through a 12-month dataset of underwriting decisions, isolating the deferral threshold that holds faithfulness above 0.97 while clawing back roughly 40% of model spend. Draft is in adversarial review; numbers are final. Expected publication this quarter.
Bring us a workload, we will return a readiness audit.
A 30-minute production review with one of the engineers behind these notes. No sales pitch. We read your eval setup, your guardrails, and your incident log, and we tell you which of the surfaces above is most likely to fail you in the next ninety days.
Stop reading about it. Run an eval against your own pipeline.
Thirty minutes, one of our platform engineers, your real workload. You leave with a written readiness audit and the eval-run artefacts to back every line of it.