§ servicesv4.2updated 2026.05

Five sharp projects. Each one takes your prototype to real production.

Every TestML project uses the same five parts. Eval. Attack tests. Drift watch. Smart routing. Production tie-in. We tune each one to your domain. To your data location. To your audit zone. Nothing gets bolted on at launch. Each artifact ships as code into the repo your team owns.

Book a production review See our test method

SOC 2 / GDPR / HIPAA
Your VPC or ours
30-min discovery · no sales pitch

§ 02 · catalog

What we actually do when you sign the SOW.

Each project is scoped end-to-end against your specific workload. Claims triage. Contract review. Clinical summary. Fraud spotting. RPA control. We pick the subset of five that fits your risk surface. And your release pace. Nothing here is a packaged retainer.

engagement.eval-suite6–10 weeks

AI Model Evaluation

An eval harness built for your domain. Legal, medical, finance, or insurance. We tune it to your gold sets. Not to public benchmarks. Suites stay versioned. They survive prompt rewrites and model upgrades.

What ships

Task rubrics co-written with your domain experts
Mixed scoring with checks, LLM judges, and human review
Suite lives in your repo with stable CI runs

artefact · in your repoship-as-code

engagement.redteam4–8 weeks

Security & Red-teaming

We fire attacks on every release. Prompt injection. Jailbreaks. Data leaks. Tool misuse. We track each one by severity. Regressions get blocked at the CI gate. All mapped to OWASP LLM Top-10.

What ships

Attack corpus mapped to your tools and search index
Payloads built for your agent graph and JSON-mode edges
Fail-closed gates wired to your CI/CD

artefact · in your repoowasp-llm-top10

engagement.drift-sentinelongoing · monthly

Continuous Monitoring

We catch drift before users do. Both in behavior and in embedding space. Traces roll up from tokens to leadership dashboards. PII gets redacted at the edge. Storage stays safe for GDPR and HIPAA.

What ships

Stability checks on inputs, outputs, and tool calls
Auto-retest when KS-distance crosses your threshold
Per-tenant alerts to PagerDuty, Opsgenie, or Slack

artefact · in your repogdpr-hipaa-safe

engagement.cascade-router5–9 weeks

Agent Optimization

Route 90% of traffic to small models. Send the rest to frontier models. Only when confidence dips. Costs drop. Latency drops. Eval scores hold.

What ships

Confidence-gated routing across Haiku, Sonnet, and Opus
Smart cache with TTL set by your retention policy
Per-tenant cost caps with safe fallback paths

artefact · in your repocost · −74% typ.

engagement.ship-it3–6 weeks

Production Integration

We tie the four pieces into the systems you already run. Eval suite. Red-team gate. Drift sentinel. Cascade router. They land in your VPC, CI, monitoring, and on-call flow.

What ships

Terraform or Pulumi modules for your cloud and identity stack
Trace export to Datadog, Grafana, Honeycomb, or your warehouse
Runbooks for on-call, with replay and rollback steps

artefact · in your repoyour-vpc · or-ours

engagement.bundleby SOW

∑

Bundled or à la carte — your call.

Most teams start with one project. Often eval or red-team. They expand once the gates are green. A few teams sign all five at once. They stand up the whole reliability layer in one quarter.

Cadences we run

Single project · 4–10 weeks · fixed scope
Reliability program · 1–2 quarters · phased gates
Embedded reliability lead · 6 months minimum

discovery · 30 minno-sales-call

§ 03 · how we run

From scope memo to steady-state — one cadence.

typical · 10–14 weeksfail-closed gates from week 6quarterly method refresh

phase 00
Discovery
Two weeks of scoping with your engineering and audit leads. We read the system. We walk the agent graph. We pick the failure modes that matter for your domain.
↳Scope memo + risk map
phase 01
Eval design
Rubrics co-written with your domain experts. Gold sets pulled from your real traffic. PII redacted up front. Suite shape and judge protocol locked.
↳Suite v0 in repo
phase 02
Implementation
We ship four pieces as code into your repo. The harness. The red-team payloads. The drift telemetry. The routing policy. Your team reviews each PR.
↳PRs · CI green
phase 03
Gate flip
We wire eval and red-team checks into your release pipeline. They run as fail-closed gates. Severity thresholds get tuned against past traffic. So week one is not a flood.
↳Release gate live
phase 04
Steady state
The drift sentinel runs always. It re-runs tests when thresholds break. It pages the right on-call. We refresh the method each quarter as your traffic and models shift.
↳Quarterly review

§ 04 · in scope · out of scope

We are not a generic AI shop, and the scope sheet proves it.

includedartefacts that ship into your repo

Eval suite tuned to your domain, versioned in your repo
Red-team payloads built for your tools, search index, and JSON-mode edges
Drift sentinel with PII-safe telemetry sent to your monitoring stack
Cascade router with smart cache and per-tenant cost caps
Terraform or Pulumi modules and CI flows for your cloud edge
Runbooks, on-call playbooks, replay tools, and rollback steps
Quarterly method refresh as your traffic, models, and risk shift

not includedthings we will route to a partner instead

Foundation-model training or pre-training. We assume you have models.
Generic AI strategy decks split off from real code paths
Off-the-shelf dashboards bolted onto your stack
Agent prototypes or proof-of-concepts. We start after the prototype.
Bulk data labeling. We plug into your label partners.
Vendor lock-in. Each artifact lives in your repo, VPC, and accounts.

§ 05 · next step

Bring us a prototype. We bring back a release gate.

A 30-minute review with the founding team. We read the trace. We walk the agent graph. Then we tell you the three changes that lift eval scores the most. No deck. No sales cycle. No follow-up unless you ask.

Book a production review Read our method

● 30-min call · no sales pitch · founders only

Five sharp projects. Each one takes your prototype to real production.

AI Model Evaluation

Security & Red-teaming

Continuous Monitoring

Agent Optimization

Production Integration

Bundled or à la carte — your call.

Discovery

Eval design

Implementation

Gate flip

Steady state

Bring us a prototype. We bring back a release gate.