Field logRFCs from production engagements

Engineering notes from production-grade, regulated AI systems.

Memos, post-mortems, and methodology drops written by the engineers who shipped the work — eval suites, drift dashboards, red-team probes, and routing economics from legal, medical, financial, and insurance deployments. Nothing is published unless it cleared an adversarial second-engineer review.

02 / logrecent entries

Memos written next to a running eval suite, not a marketing brief.

Each entry below was authored during a live engagement, signed by the engineer who held the pager, and reviewed by a second engineer who tried to break the claim. Customer-identifying detail is redacted; the methodology and the numbers are not.

RFC-014
2026-04-22
author: platform-eng
domain: underwriting

Cascading Haiku → Opus on insurance underwriting: 41% spend recovery without faithfulness loss

Routing layer that defers to the larger model only when first-pass confidence drops below 0.78. We share the eval matrix, the cost-per-decision math, and the two failure modes we hit before stabilising on a token-budget guard.

review: adversarialartefacts: eval-run + traceshipped: yes
spend Δ-41%
p95 ms218
faithfulness0.974
RFC-013
2026-04-09
author: evals
domain: legal-discovery

Schema-strict outputs for privileged-document review: why JSON-mode is the floor, not the ceiling

When a redaction agent fabricated a Bates range that did not exist, JSON-mode was the gate that caught it. We document the strict-schema harness we wrap around every legal-domain agent and the three test families it exposes.

review: adversarialartefacts: eval-run + traceshipped: yes
schema_strict1.000
cases3,840
leak rate0.0%
RFC-012
2026-03-27
author: red-team
domain: clinical-triage

Prompt-injection class r3: indirect attacks via reference documents in HIPAA workloads

Two leakage paths surfaced when intake summaries were pasted from upstream systems containing adversarial markdown. We publish the probe pack, the regression cases, and the policy-layer mitigation that closes both.

review: adversarialartefacts: eval-run + traceshipped: yes
probes186
blocked184
ship statusship-with-guardrail
RFC-011
2026-03-12
author: observability
domain: fraud-ops

Embedding-KL drift over 90 days: a calmer alarm than perplexity for multi-agent fraud queues

Perplexity flapped on holiday traffic; embedding KL did not. We share the drift dashboard, the alert thresholds we settled on, and the playbook when KL exceeds 0.05 against the production reference window.

review: adversarialartefacts: eval-run + traceshipped: yes
emb_KL0.031
trendstable
false alarms-86%
RFC-010
2026-02-28
author: platform-eng
domain: claims-intake

Decision tracing through a 7-agent claims pipeline: the trace ID is the new request ID

How we propagate a single trace through router, retriever, validator, summariser, classifier, escalation, and audit agents — and how the trace becomes the contract that the eval suite, the SOC 2 auditor, and the on-call engineer all read.

review: adversarialartefacts: eval-run + traceshipped: yes
agents7
trace coverage100%
MTTR Δ-62%
RFC-009
2026-02-11
author: compliance-eng
domain: compliance

GDPR + HIPAA + SOC 2 with one evidence pipeline: keeping auditors and engineers on the same artefact

We rebuilt our control evidence so that the same eval-run record satisfies the model-risk auditor, the privacy officer, and the security reviewer. The artefact, the retention policy, and the gotchas we hit are documented here.

review: adversarialartefacts: eval-run + traceshipped: yes
controls146
frameworks3
manual evidence0

03 / topicswhat we write about

Six surfaces where production AI either holds or breaks.

We publish where we have shipped — and only where we have shipped. If a topic is missing, it is missing on purpose: we have not yet done the work in production to claim it.

{ 01 }

Eval design

  • Domain-specific suites for legal, medical, financial, insurance.
  • Faithfulness, schema, PII, and cost-per-decision as first-class checks.
  • Golden sets that survive model swaps.
{ 02 }

Red-team & guardrails

  • Indirect injection via retrieved documents.
  • Tool-use abuse, jailbreak families, exfiltration probes.
  • Mitigations gated by regression cases, not vibes.
{ 03 }

Routing & economics

  • Cascade policies, cache strategies, deferral thresholds.
  • Token-budget guards that keep p95 stable under fan-out.
  • Cost-per-decision as a shipped metric, not a slide.
{ 04 }

Drift & observability

  • Embedding-KL over moving production windows.
  • Decision traces that an auditor and an SRE can both read.
  • Quiet alarms over loud ones.
{ 05 }

Multi-agent systems

  • Contracts between agents written as schemas, not prose.
  • Failure isolation when one tool returns garbage.
  • Replay harnesses for incidents that crossed five hops.
{ 06 }

Compliance posture

  • GDPR, HIPAA, SOC 2 mapped to the same evidence artefact.
  • Retention, lineage, and reviewer access as engineering work.
  • Auditor-ready before the auditor arrives.

04 / processhow a memo is born

The pipeline that decides whether an entry ships.

Notes follow the same gate that production code follows at TestML: artefacts, adversarial review, redaction, and an engineer's name on the byline. No ghostwriting, no marketing rewrites, no claims without a captured eval run behind them.

step 01
engagement.start

Every memo begins as a production engagement. We write nothing we have not shipped behind a SOC 2-controlled boundary on a real customer workload.

step 02
evidence.capture

Eval runs, drift snapshots, red-team probes, and routing economics are captured as artefacts during the engagement — not reconstructed afterwards from memory.

step 03
review.adversarial

A second engineer who was not on the engagement attempts to break the claim. If the claim survives, the note moves to draft. If it does not, the note is rewritten or shelved.

step 04
redact.publish

We redact customer-identifying detail, keep the methodology and the numbers, and publish under the engineer who ran the work. No ghostwriting, no marketing rewrites.

Currently shipping

A long-form note on cascading routers under regulated workloads.

We are working through a 12-month dataset of underwriting decisions, isolating the deferral threshold that holds faithfulness above 0.97 while clawing back roughly 40% of model spend. Draft is in adversarial review; numbers are final. Expected publication this quarter.

Methodology in detail
If you operate one of these systems

Bring us a workload, we will return a readiness audit.

A 30-minute production review with one of the engineers behind these notes. No sales pitch. We read your eval setup, your guardrails, and your incident log, and we tell you which of the surfaces above is most likely to fail you in the next ninety days.

Book a production review

// engagement.start

Stop reading about it. Run an eval against your own pipeline.

Thirty minutes, one of our platform engineers, your real workload. You leave with a written readiness audit and the eval-run artefacts to back every line of it.

Book your readiness audit