engagement.eval-suite6–10 weeks
01
AI Model Evaluation
An eval harness built for your domain. Legal, medical, finance, or insurance. We tune it to your gold sets. Not to public benchmarks. Suites stay versioned. They survive prompt rewrites and model upgrades.
What ships- Task rubrics co-written with your domain experts
- Mixed scoring with checks, LLM judges, and human review
- Suite lives in your repo with stable CI runs
engagement.redteam4–8 weeks
02
Security & Red-teaming
We fire attacks on every release. Prompt injection. Jailbreaks. Data leaks. Tool misuse. We track each one by severity. Regressions get blocked at the CI gate. All mapped to OWASP LLM Top-10.
What ships- Attack corpus mapped to your tools and search index
- Payloads built for your agent graph and JSON-mode edges
- Fail-closed gates wired to your CI/CD
engagement.drift-sentinelongoing · monthly
03
Continuous Monitoring
We catch drift before users do. Both in behavior and in embedding space. Traces roll up from tokens to leadership dashboards. PII gets redacted at the edge. Storage stays safe for GDPR and HIPAA.
What ships- Stability checks on inputs, outputs, and tool calls
- Auto-retest when KS-distance crosses your threshold
- Per-tenant alerts to PagerDuty, Opsgenie, or Slack
engagement.cascade-router5–9 weeks
04
Agent Optimization
Route 90% of traffic to small models. Send the rest to frontier models. Only when confidence dips. Costs drop. Latency drops. Eval scores hold.
What ships- Confidence-gated routing across Haiku, Sonnet, and Opus
- Smart cache with TTL set by your retention policy
- Per-tenant cost caps with safe fallback paths
engagement.ship-it3–6 weeks
05
Production Integration
We tie the four pieces into the systems you already run. Eval suite. Red-team gate. Drift sentinel. Cascade router. They land in your VPC, CI, monitoring, and on-call flow.
What ships- Terraform or Pulumi modules for your cloud and identity stack
- Trace export to Datadog, Grafana, Honeycomb, or your warehouse
- Runbooks for on-call, with replay and rollback steps
engagement.bundleby SOW
∑
Bundled or à la carte — your call.
Most teams start with one project. Often eval or red-team. They expand once the gates are green. A few teams sign all five at once. They stand up the whole reliability layer in one quarter.
Cadences we run- Single project · 4–10 weeks · fixed scope
- Reliability program · 1–2 quarters · phased gates
- Embedded reliability lead · 6 months minimum