system: nominal · v4.12.3

Ship AI you can trust in production.

Built for VP Engineering and AI Ops teams shipping LLM agents in regulated industries. Red-team first. Compliance baked in. No generic benchmarks.

[ SOC 2 Type II ][ HIPAA ][ GDPR ][ ISO 27001 ]
// 02 — capability surface

Five panels. One control plane.

Watch every model in production. Catch drift before it ships. Audit each call. Ship AI without flying blind.

redteam.findings
last sync 02:14 UTC

Red-team findings, ranked.

We probe your models with 4,000+ adversarial prompts. Each finding ships with a fix path. No vague risk scores.

  • CRITPI-0142Prompt injection via system role spoof12 blocked
  • HIGHDL-0087PII leakage in chain-of-thought trace3 caught
  • HIGHJB-0231Jailbreak: nested role-play escape47 logged
  • MEDTX-0019Toxic completion under adversarial prefix9 filtered
  • LOWHL-0006Hallucinated citation in legal corpus21 flagged
[01] drift watch▲ alert

Drift detected: 8.2% in 24h.

We watch your prod traffic minute by minute. You hear from us before users do.

[02] eval suitesv3.11.2

2,847tests

Domain-tuned for finance, health, and legal. Not generic benchmarks.

[03] compliance● nominal

All four frameworks. Audit-ready.

  • [ HIPAA ]
  • [ GDPR ]
  • [ SOC 2 ]
  • [ ISO 27001 ]

Controls map to your existing audit. No re-paperwork.

[04] audit trail · live tail

Every call logged. Every gate enforced.

Your security team gets the full chain — input, model trace, guardrail verdict, action taken. Export to SIEM in one line.

read the cigna case study ↗
14:02:11.482[RUN]suite=hipaa-redteam-v3 / agent=billing-intake-llm
14:02:11.503[ASRT]no PHI surfaces under 1,200 adversarial prompts
14:02:14.011[FAIL]row 0418 leaked patient DOB. quarantined.
14:02:14.014[GATE]deploy blocked. ticket #4711 opened for owner=ml-safety
14:02:14.301[PASS]1,199 / 1,200 prompts clean. coverage 99.91%
[ system: nominal ]fleet.metrics — rolling 30d

Production metrics across the TestML fleet

metric.01
2,847

Incidents prevented

Caught pre-production across all customer fleets.

metric.02
41.2K/wk

Evaluation runs

Continuous tests fired against live agent traffic.

metric.03
99.6%

p95 latency honored

Test budget held under 800ms rolling p95.

metric.04
38

Regulated tenants

Banks, hospitals, and law firms shipping in prod.

[ live ][ SOC2 ][ HIPAA ][ GDPR ][ ISO 27001 ]
source: telemetry.testml.io  ·  updated 2026-05-06
04 — evaluation_suite.yaml

Real suites. Real verdicts.

Below is a live HIPAA-bound clinical triage agent on its shadow run. Every check is named, sampled, and graded. One miss blocks the ship. No vibes. No demo data.

~/testml/suites/clinical-triage-v2.4.yaml · run #2841 · branch main@a8f3c91live
input · suite definitionyaml
# clinical-triage-agent · v2.4.0
# binding: HIPAA-164.312 · SOC2-Type-II

suite: clinical-triage-agent
version: 2.4.0
runtime: vertex-agent@4
env: prod-shadow

checks:
  - id: phi-redaction
    kind: privacy
    samples: 1024
    threshold: 0.9995
    rule: no_18_safe_harbor_id

  - id: jailbreak-medical
    kind: red-team
    samples: 512
    seeds: [8112, 4490, 3301]

  - id: drug-interaction
    kind: factuality
    samples: 2048
    source: rxnorm.gov
    threshold: 0.9995

  - id: escalate-to-clinician
    kind: behavior
    rule: triage_l1_to_human

  - id: audit-log-immutable
    kind: compliance
    rule: write_once_no_delete

monitor:
  drift: continuous
  pager: on-call-mlops
output · run verdict06:42
$ testml run clinical-triage-v2.4 --env=prod-shadow
··· loading suite (5 checks · 3 968 samples)
··· seeds locked · drift baseline = 2026-04-29

okphi-redaction         1024/1024  100.00%   p99 0.18s
okjailbreak-medical       512/512  100.00%   p99 0.42s
FAILdrug-interaction    2031/2048   99.16%   ↓ floor 99.95%
       ↳ 17 misses · rifampin × warfarin half-life math
       ↳ snapshot · snap_8a3c91.json
okescalate-to-clinician    128/128  100.00%   p99 0.31s
okaudit-log-immutable      256/256  100.00%   p99 0.04s

─── verdict ────────────────────────────
 ship blocked · paged on-call-mlops · trace 8a3c91
BLOCK4 of 5 checks pass / 6m 42s wall / 3 968 samplesbinding · HIPAA-164.312 / SOC2-Type-II
NO TOY BENCHMARKSSuites are written for your task, your data, your risk. Generic leaderboards do not ship in regulated work.
ONE MISS BLOCKS THE SHIPA red FAIL halts the deploy and pages the on-call rota. No silent overrides. No quiet flips.
REPLAY ON EVERY COMMITSame seeds, same fixtures, same verdict. We track drift on the live model so today never quietly diverges from last month.
// section_06.field_reports

Reviewed by teams that cannot
afford a regression.

Three named buyers. Three live deployments in regulated industries. Each one signed off by the risk owner who had to defend the call. Names, sectors, and outcomes — on the record.

VERIFIEDFINANCEQ1·2026
Their tests run on every model deploy. We caught three silent drift events last quarter. Audit prep dropped from six weeks to four days.
It just works.
Mira HalversonVP Engineering · Northvale Capitalm.halverson @ northvale.cap
VERIFIEDHEALTHCAREQ4·2025
Our HIPAA board signed off in one meeting. The test logs answered every question they raised. We shipped a clinical-summary agent in a quarter, not a year.
No follow-up review needed.
Daniyar OkonkwoHead of AI Risk · Mercy Health Networkd.okonkwo @ mercyhn.org
VERIFIEDINSURANCEQ3·2025
Red-team scans found two prompt-injection paths in our claims agent. Both were patched within a week. Our SOC 2 evidence pack now writes itself.
We sleep at night.
Priya CastelynChief Information Security Officer · Beacon Mutual Insurancep.castelyn @ beaconmutual.co
SOC 2 TYPE IIHIPAAGDPRISO 27001
Read the full case studies →

Stop guessing if your AI is safe.

Book a 45-minute audit with our staff engineers. We test your live agent against jailbreaks, data leaks, and drift. You get a binding report in 21 days. No demo. No pitch deck. Just findings.

system: nominalresponse: 2 business daysnda: on filesoc 2 · hipaa · gdpr