What a production-ready AI system actually requires.

The gap between a system that works in testing and one that holds up under real usage — lessons from building across GTM, support, and marketing.

A prototype that works in a demo is the easiest part of building an AI system. The hardest part isn't the model, the data, or even the prompt. It's the slow, unglamorous work of turning something that usually behaves into something you can run without holding your breath.

Most teams underestimate the distance between those two states by an order of magnitude.

The test environment lies to you

When you test a new AI system, you test with the examples you thought of. Users bring the examples you didn't. They phrase things in ways that break your assumptions, feed in formats your parser didn't anticipate, and ask follow-up questions that expose a gap in your context. The system doesn't crash. It just starts being subtly wrong.

Subtle wrongness is the failure mode that matters. A crash you can alert on. A confidently incorrect answer to a customer question — that one takes weeks to surface, and by then it's shaped someone's opinion of your product.

Production isn't a bigger version of testing. It's a different problem.

What "production-ready" actually requires

Across the systems I've put into production — SMS-based accountability programs, two-phase LLM generation pipelines, multi-market pricing lookups — the checklist has converged to something like this:

1. Observability before scale

Every model call logged with inputs, outputs, latency, and cost. Not because you'll look at all of them — because when something breaks, you need the trail to actually exist. Retrofitting this is ten times the work of building it in from the start.

2. A fallback for every brittle dependency

Model timeout. API rate limit. Malformed response. For each, what does the system do? "Throw an error to the user" is not an acceptable answer in a product that runs unattended.

3. Evals that match what you actually care about

Not generic benchmarks. A set of real inputs drawn from production, graded by someone who knows what "good" looks like for your use case. Run them on every prompt change. If you can't produce a pass/fail number, you can't tell whether your changes are improvements or regressions.

4. Someone responsible for drift

Models update. Providers deprecate versions. User behavior shifts. Without a named owner watching for drift, you discover it when a customer complains. With one, you find it first.

5. A plan for the bad day

The model provider has an outage. What happens? Your bill triples overnight from a traffic spike. What happens? Someone publishes a screenshot of a weird output on social. What happens? Each of these has happened to teams I've worked with. The ones that handled it well had already thought about it.

The unsexy truth

Most of what separates a production-ready AI system from a demo isn't AI work at all. It's the operational scaffolding — logging, evals, fallbacks, ownership — that any mature software system needs. The difference is that AI systems fail in ways that are harder to spot, so skipping the scaffolding costs you faster.

The fastest teams I've worked with aren't the ones that skip this. They're the ones that build the scaffolding reflexively, so it takes them an afternoon instead of a quarter.