Field logRFCs from production engagements

Engineering notes from production-grade, regulated AI systems.

Memos, post-mortems, and methodology drops written by the engineers who shipped the work — eval suites, drift dashboards, red-team probes, and routing economics from legal, medical, financial, and insurance deployments. Nothing is published unless it cleared an adversarial second-engineer review.

Read the methodology See case studies

02 / logrecent entries

Memos written next to a running eval suite, not a marketing brief.

Each entry below was authored during a live engagement, signed by the engineer who held the pager, and reviewed by a second engineer who tried to break the claim. Customer-identifying detail is redacted; the methodology and the numbers are not.

RFC-014

2026-04-22

author: platform-eng

domain: underwriting

Cascading Haiku → Opus on insurance underwriting: 41% spend recovery without faithfulness loss

Routing layer that defers to the larger model only when first-pass confidence drops below 0.78. We share the eval matrix, the cost-per-decision math, and the two failure modes we hit before stabilising on a token-budget guard.

review: adversarialartefacts: eval-run + traceshipped: yes

spend Δ-41%

p95 ms218

faithfulness0.974

RFC-013

2026-04-09

author: evals

domain: legal-discovery

Schema-strict outputs for privileged-document review: why JSON-mode is the floor, not the ceiling

When a redaction agent fabricated a Bates range that did not exist, JSON-mode was the gate that caught it. We document the strict-schema harness we wrap around every legal-domain agent and the three test families it exposes.

review: adversarialartefacts: eval-run + traceshipped: yes

schema_strict1.000

cases3,840

leak rate0.0%

RFC-012

2026-03-27

author: red-team

domain: clinical-triage

Prompt-injection class r3: indirect attacks via reference documents in HIPAA workloads

Two leakage paths surfaced when intake summaries were pasted from upstream systems containing adversarial markdown. We publish the probe pack, the regression cases, and the policy-layer mitigation that closes both.

review: adversarialartefacts: eval-run + traceshipped: yes

probes186

blocked184

ship statusship-with-guardrail

RFC-011

2026-03-12

author: observability

domain: fraud-ops

Embedding-KL drift over 90 days: a calmer alarm than perplexity for multi-agent fraud queues

Perplexity flapped on holiday traffic; embedding KL did not. We share the drift dashboard, the alert thresholds we settled on, and the playbook when KL exceeds 0.05 against the production reference window.

review: adversarialartefacts: eval-run + traceshipped: yes

emb_KL0.031

trendstable

false alarms-86%

RFC-010

2026-02-28

author: platform-eng

domain: claims-intake

Decision tracing through a 7-agent claims pipeline: the trace ID is the new request ID

How we propagate a single trace through router, retriever, validator, summariser, classifier, escalation, and audit agents — and how the trace becomes the contract that the eval suite, the SOC 2 auditor, and the on-call engineer all read.

review: adversarialartefacts: eval-run + traceshipped: yes

agents7

trace coverage100%

MTTR Δ-62%

RFC-009

2026-02-11

author: compliance-eng

domain: compliance

GDPR + HIPAA + SOC 2 with one evidence pipeline: keeping auditors and engineers on the same artefact

We rebuilt our control evidence so that the same eval-run record satisfies the model-risk auditor, the privacy officer, and the security reviewer. The artefact, the retention policy, and the gotchas we hit are documented here.

review: adversarialartefacts: eval-run + traceshipped: yes

controls146

frameworks3

manual evidence0

03 / topicswhat we write about

Six surfaces where production AI either holds or breaks.

We publish where we have shipped — and only where we have shipped. If a topic is missing, it is missing on purpose: we have not yet done the work in production to claim it.

{ 01 }

Eval design

Domain-specific suites for legal, medical, financial, insurance.
Faithfulness, schema, PII, and cost-per-decision as first-class checks.
Golden sets that survive model swaps.

{ 02 }

Red-team & guardrails

Indirect injection via retrieved documents.
Tool-use abuse, jailbreak families, exfiltration probes.
Mitigations gated by regression cases, not vibes.

{ 03 }

Routing & economics

Cascade policies, cache strategies, deferral thresholds.
Token-budget guards that keep p95 stable under fan-out.
Cost-per-decision as a shipped metric, not a slide.

{ 04 }

Drift & observability

Embedding-KL over moving production windows.
Decision traces that an auditor and an SRE can both read.
Quiet alarms over loud ones.

{ 05 }

Multi-agent systems

Contracts between agents written as schemas, not prose.
Failure isolation when one tool returns garbage.
Replay harnesses for incidents that crossed five hops.

{ 06 }

Compliance posture

GDPR, HIPAA, SOC 2 mapped to the same evidence artefact.
Retention, lineage, and reviewer access as engineering work.
Auditor-ready before the auditor arrives.

04 / processhow a memo is born

The pipeline that decides whether an entry ships.

Notes follow the same gate that production code follows at TestML: artefacts, adversarial review, redaction, and an engineer's name on the byline. No ghostwriting, no marketing rewrites, no claims without a captured eval run behind them.

step 01

engagement.start

Every memo begins as a production engagement. We write nothing we have not shipped behind a SOC 2-controlled boundary on a real customer workload.

step 02

evidence.capture

Eval runs, drift snapshots, red-team probes, and routing economics are captured as artefacts during the engagement — not reconstructed afterwards from memory.

step 03

review.adversarial

A second engineer who was not on the engagement attempts to break the claim. If the claim survives, the note moves to draft. If it does not, the note is rewritten or shelved.

step 04

redact.publish

We redact customer-identifying detail, keep the methodology and the numbers, and publish under the engineer who ran the work. No ghostwriting, no marketing rewrites.

Currently shipping

A long-form note on cascading routers under regulated workloads.

We are working through a 12-month dataset of underwriting decisions, isolating the deferral threshold that holds faithfulness above 0.97 while clawing back roughly 40% of model spend. Draft is in adversarial review; numbers are final. Expected publication this quarter.

Methodology in detail

If you operate one of these systems

Bring us a workload, we will return a readiness audit.

A 30-minute production review with one of the engineers behind these notes. No sales pitch. We read your eval setup, your guardrails, and your incident log, and we tell you which of the surfaces above is most likely to fail you in the next ninety days.

Book a production review

// engagement.start

Stop reading about it. Run an eval against your own pipeline.

Thirty minutes, one of our platform engineers, your real workload. You leave with a written readiness audit and the eval-run artefacts to back every line of it.

Book your readiness audit

────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────