Production AI is mostly workflow design

The model-intelligence obsession misreads where production AI actually succeeds or fails. Across government, enterprise, and consumer, the wins came from orchestration, retrieval, evaluation, and fallback handling, not a smarter model.

20 May 2025 5 min read

The model is the smallest part of a production AI system that works. Most of the engineering that decides whether the thing is trusted in production lives in the plumbing around the model: what you retrieve before you call it, how you check what comes back, what happens when it fails, and who has to approve the result. Swap a good model for a great one and the system gets marginally better. Get the workflow wrong and the best model on the market still ships you something nobody will rely on.

I have built this across three settings that could not be less alike, and the lesson held in all of them.

Government: the model interprets, the engine decides

At VicRoads I worked on pricing custom number plates, a product sitting on a P&L north of $100M and serving more than six million drivers. The hard part is that a requested plate has meaning. “GOAT” is worth more than “X7QJZ”, and a person can see why instantly. The naive design asks a model to read the plate and name a price.

That design is unshippable in government, and the reason is the whole point of this essay. A price that a model produced cannot be explained, cannot be audited, and cannot be defended when a customer or a minister asks why this plate cost that much. So the workflow splits the job. The model interprets meaning: it reads the requested string and classifies it into features a pricing model can use. A deterministic engine then sets the price from those features. The model never touches the number. Prices stay explainable because a rule produced them, and the model contributes the one thing rules are bad at, which is reading intent out of arbitrary text. None of that reliability comes from model intelligence. It comes from where the boundary sits in the workflow.

Enterprise: retrieval and evaluation are the product

At Brand Ninja the model generated brand content for serious accounts, the kind that close six-figure contracts. The quality a customer experiences is mostly upstream and downstream of generation, not in it.

Upstream is retrieval. A model with no access to a brand’s guidelines, prior campaigns, and tone will produce generic, off-brand content. The work that moved quality was assembling the right context before the call: what this brand sounds like, what they have published, what they have rejected. Get retrieval right and an average model writes on-brand. Get it wrong and a frontier model writes fluent nonsense.

Downstream is evaluation. You cannot ship generated content to an enterprise account on the assumption it is fine. You need an evaluation loop that scores output against the brand’s constraints, flags what fails, and routes it for human review before it goes near a customer. The evaluation harness is what lets you change a prompt or a model and know within minutes whether you broke something, rather than finding out when an account complains. It is unglamorous and it is most of the reliability.

Consumer: fallbacks are the experience

At hey anna, solo-built and bootstrapped, the discipline is sharpest because there is no team to absorb a failure. The model can be slow, can rate-limit, can return something malformed, can be wrong. A consumer product that assumes none of that happens is a demo. The workflow has to answer every one of those cases: retry with backoff, degrade to a smaller path, surface a clear state instead of a spinner that never resolves, and never present a fabricated answer as a real one. The “analyst, not chatbot” promise is kept by fallback handling as much as by anything the model does, because an analyst you cannot trust when the data is thin is not an analyst.

The parts that actually move reliability

Strip the three settings down and the same components carry the weight:

Component	What it does	What breaks without it
Orchestration	sequences steps, decides what runs when	one giant prompt doing five jobs badly
Retrieval	assembles the right context before the call	confident, fluent, generic wrong answers
Evaluation	scores output, catches regressions	you find out it broke when a customer does
Approvals	puts a human on irreversible actions	the system ships mistakes at machine speed
Fallbacks	handles failure, timeout, malformed output	a demo that fails on the first malformed response

A smarter model improves the text inside each box. It does not build the boxes, and it does not connect them. That connective work is ordinary software engineering applied to a probabilistic component, and it is where production AI is won.

Designing the pieces the model composes

There’s a shift inside orchestration worth naming, because it changes what the design work is. The traditional shape is a linear workflow: you decide the steps and their order in advance, step one feeds step two feeds step three, and the path is fixed before any request arrives. That is still the right answer when the path must run the same way every time, the way the VicRoads pricing split has to interpret first and price second, every time, to stay auditable.

The newer paradigm inverts it. Instead of designing the sequence, you design the tools and let the model compose the workflow on the fly. It decides, for the request in front of it, which tools to call and in what order, assembling them like pieces of a puzzle to fit a task you never explicitly wired. The design work moves from drawing the path to shaping the pieces: each tool a clean, well-bounded interface to a more complex system underneath, a query engine, a retrieval index, a pricing service, an external API. The model never sees the mess behind the tool. It sees a handle it can pull.

That relocates the hard part rather than removing it. Designing a good tool surface is its own discipline: a tool that is ambiguous, leaky, or quietly does three things gets composed into confident nonsense, the same way a bad function signature breeds bugs downstream. Shape the pieces well, each one legible and honest about what it does, and the model covers paths you could never have enumerated by hand. Most real systems settle into both at once: a deterministic spine for the parts that must not vary, and composable tools at the edges for the parts where the range of tasks is too wide to wire in advance.

This is the substrate beneath the surface. The companion argument is that the model should be a dumb renderer at the point it touches the user; this is what has to be true underneath for that surface to hold. The render is only as trustworthy as the workflow feeding it. Spend your effort on the workflow, and the model you already have is usually good enough.