Working-code-first

Production ML guides built on reproducible code.

Technical writing and open-source tooling for engineers running LLMs, vector databases, and evaluation harnesses in real systems.

180 reproducible guides since 2021

What practitioners say

Three engineers who use our writing and tooling in their day jobs.

Their evaluation harness gates every weekly model promotion we ship at the platform. Reading their retrieval rigor guides stopped two regressions before they hit production payments traffic this past quarter.

Daniel Okafor

Staff ML engineer at fintech platform, Berlin

We adopted their retrieval benchmarks for our vendor selection and the transparent methodology made the call straightforward. Their reproducible repositories meant we could rerun every comparison on our own logistics data.

Hannah Beck

Applied scientist at logistics company, Munich

Their deployment guides are required reading on our team before we ship any new inference path to customers. The operational notes on observability saved us from rolling out a broken cache layer last month.

Tomasz Kowal

Founding engineer at AI startup, Amsterdam

What we cover

Six technical areas where production ML usually breaks, and the patterns that hold up under real load.

Vector databases

Comparison guides, indexing tradeoffs, and retrieval benchmarks rerun on independent hardware before any results ship.

Evaluation harnesses

Open-source packages for gating model promotions, with 94,000 monthly installs across our test and benchmark tooling.

Model lifecycle

Deployment patterns for inference, rollback, and feature versioning, written by engineers who run these systems daily.

LLM operations

Operational realities of running LLMs in production, including cost accounting, latency budgets, prompt regression, and incident review.

Transparent benchmarks

Every benchmark we publish ships with raw data and a reproducible repository so any reader can verify the numbers.

Practical deployment

Code-first walkthroughs for shipping retrieval, embeddings, and inference paths to production without breaking the weekly release.

Ship ML systems that hold up in production.

Working code, transparent benchmarks.