§ servicesv4.2updated 2026.05

Five sharp projects. Each one takes your prototype to real production.

Every TestML project uses the same five parts. Eval. Attack tests. Drift watch. Smart routing. Production tie-in. We tune each one to your domain. To your data location. To your audit zone. Nothing gets bolted on at launch. Each artifact ships as code into the repo your team owns.

  • SOC 2 / GDPR / HIPAA
  • Your VPC or ours
  • 30-min discovery · no sales pitch

§ 02 · catalog

What we actually do when you sign the SOW.

Each project is scoped end-to-end against your specific workload. Claims triage. Contract review. Clinical summary. Fraud spotting. RPA control. We pick the subset of five that fits your risk surface. And your release pace. Nothing here is a packaged retainer.

engagement.eval-suite6–10 weeks

01

AI Model Evaluation

An eval harness built for your domain. Legal, medical, finance, or insurance. We tune it to your gold sets. Not to public benchmarks. Suites stay versioned. They survive prompt rewrites and model upgrades.

What ships
  • Task rubrics co-written with your domain experts
  • Mixed scoring with checks, LLM judges, and human review
  • Suite lives in your repo with stable CI runs
artefact · in your repoship-as-code
engagement.redteam4–8 weeks

02

Security & Red-teaming

We fire attacks on every release. Prompt injection. Jailbreaks. Data leaks. Tool misuse. We track each one by severity. Regressions get blocked at the CI gate. All mapped to OWASP LLM Top-10.

What ships
  • Attack corpus mapped to your tools and search index
  • Payloads built for your agent graph and JSON-mode edges
  • Fail-closed gates wired to your CI/CD
artefact · in your repoowasp-llm-top10
engagement.drift-sentinelongoing · monthly

03

Continuous Monitoring

We catch drift before users do. Both in behavior and in embedding space. Traces roll up from tokens to leadership dashboards. PII gets redacted at the edge. Storage stays safe for GDPR and HIPAA.

What ships
  • Stability checks on inputs, outputs, and tool calls
  • Auto-retest when KS-distance crosses your threshold
  • Per-tenant alerts to PagerDuty, Opsgenie, or Slack
artefact · in your repogdpr-hipaa-safe
engagement.cascade-router5–9 weeks

04

Agent Optimization

Route 90% of traffic to small models. Send the rest to frontier models. Only when confidence dips. Costs drop. Latency drops. Eval scores hold.

What ships
  • Confidence-gated routing across Haiku, Sonnet, and Opus
  • Smart cache with TTL set by your retention policy
  • Per-tenant cost caps with safe fallback paths
artefact · in your repocost · −74% typ.
engagement.ship-it3–6 weeks

05

Production Integration

We tie the four pieces into the systems you already run. Eval suite. Red-team gate. Drift sentinel. Cascade router. They land in your VPC, CI, monitoring, and on-call flow.

What ships
  • Terraform or Pulumi modules for your cloud and identity stack
  • Trace export to Datadog, Grafana, Honeycomb, or your warehouse
  • Runbooks for on-call, with replay and rollback steps
artefact · in your repoyour-vpc · or-ours
engagement.bundleby SOW

Bundled or à la carte — your call.

Most teams start with one project. Often eval or red-team. They expand once the gates are green. A few teams sign all five at once. They stand up the whole reliability layer in one quarter.

Cadences we run
  • Single project · 4–10 weeks · fixed scope
  • Reliability program · 1–2 quarters · phased gates
  • Embedded reliability lead · 6 months minimum
discovery · 30 minno-sales-call

§ 03 · how we run

From scope memo to steady-state — one cadence.

typical · 10–14 weeksfail-closed gates from week 6quarterly method refresh
  1. phase 00

    Discovery

    Two weeks of scoping with your engineering and audit leads. We read the system. We walk the agent graph. We pick the failure modes that matter for your domain.

    Scope memo + risk map

  2. phase 01

    Eval design

    Rubrics co-written with your domain experts. Gold sets pulled from your real traffic. PII redacted up front. Suite shape and judge protocol locked.

    Suite v0 in repo

  3. phase 02

    Implementation

    We ship four pieces as code into your repo. The harness. The red-team payloads. The drift telemetry. The routing policy. Your team reviews each PR.

    PRs · CI green

  4. phase 03

    Gate flip

    We wire eval and red-team checks into your release pipeline. They run as fail-closed gates. Severity thresholds get tuned against past traffic. So week one is not a flood.

    Release gate live

  5. phase 04

    Steady state

    The drift sentinel runs always. It re-runs tests when thresholds break. It pages the right on-call. We refresh the method each quarter as your traffic and models shift.

    Quarterly review

§ 04 · in scope · out of scope

We are not a generic AI shop, and the scope sheet proves it.

includedartefacts that ship into your repo
  • Eval suite tuned to your domain, versioned in your repo
  • Red-team payloads built for your tools, search index, and JSON-mode edges
  • Drift sentinel with PII-safe telemetry sent to your monitoring stack
  • Cascade router with smart cache and per-tenant cost caps
  • Terraform or Pulumi modules and CI flows for your cloud edge
  • Runbooks, on-call playbooks, replay tools, and rollback steps
  • Quarterly method refresh as your traffic, models, and risk shift
not includedthings we will route to a partner instead
  • Foundation-model training or pre-training. We assume you have models.
  • Generic AI strategy decks split off from real code paths
  • Off-the-shelf dashboards bolted onto your stack
  • Agent prototypes or proof-of-concepts. We start after the prototype.
  • Bulk data labeling. We plug into your label partners.
  • Vendor lock-in. Each artifact lives in your repo, VPC, and accounts.

§ 05 · next step

Bring us a prototype. We bring back a release gate.

A 30-minute review with the founding team. We read the trace. We walk the agent graph. Then we tell you the three changes that lift eval scores the most. No deck. No sales cycle. No follow-up unless you ask.

  30-min call · no sales pitch · founders only