Three tiers of AI, one orchestration layer

Production AI is not about which model you pick. It is about how you compose three tiers of capability — light, mid, and heavy — into a system that does useful work without spending the budget on the wrong tier. Here is the pattern we use.

The three tiers

Tier 1 — Listeners.

Lightweight, fast, cheap models. Their job is to recognise: is this a thing? Does this message contain an intent? Is this image a duplicate of one we have already seen? Listeners run on every event, ten thousand times an hour. They cost almost nothing per call. They make the right tier-2 worker available; they do not do the work themselves.

Tier 2 — Workers.

Mid-weight models tuned to a specific task. Summarise this conversation. Extract these fields. Generate this routine answer. Workers are the bulk of the AI compute in most systems. They handle the well-shaped tasks that come up often. They are good enough — almost always — and fast.

Tier 3 — Reasoners.

The heaviest models, reserved for the calls that need them. Multi-step reasoning, novel problem-solving, the messy cases that the worker tier refuses. Reasoners are slow and expensive. You want to use them sparingly. The system's design problem is to make sure they only run when they should.

The orchestration layer

The orchestration layer is what ties the three tiers together. It is not an AI itself — it is plain old software that knows which tier to invoke for which case, when to escalate from a worker to a reasoner, and how to handle the failures.

Three responsibilities matter most:

Routing. Given an event from a tier-1 listener, decide which tier-2 worker should handle it. This is mostly classification — a small lookup table or a tier-1 model that has been trained on prior cases.
Escalation. When a worker returns low confidence or refuses, escalate to a reasoner. Define what "low confidence" means in your domain — usually a calibrated probability threshold or an explicit refusal token.
Audit. Every call at every tier is logged with the prompt, the response, the model, the tokens, and the timing. This is the spine of any later evaluation work. If you skip this, you cannot improve the system; you can only replace it.

The system's job is not to be intelligent. It is to invoke intelligence economically.

Why this matters

Without tiering, two failure modes dominate. Either every call hits the heaviest model — fast budget burn, slow responses, and a system that feels expensive — or every call hits a lightweight model that misses the hard cases. The tiered design avoids both by treating the model layer as a portfolio: a few heavy calls, many light calls, and a routing layer that knows the difference.

The patterns we use

Always start with a listener.

Even when the system only has one tier-2 worker today, build a listener in front of it. The listener will earn its keep the moment you add a second worker.

Define refusal explicitly.

Workers should refuse the cases they should not handle. The refusal is the escalation signal. Without explicit refusal tokens, the worker either guesses (bad) or the orchestration layer cannot tell what happened (worse).

Cache the deterministic results.

Tier-1 and tier-2 results for identical inputs are often identical or close enough. Cache them. The hit rate is higher than you think, and it is free latency reduction.

Evaluate per tier.

The hardest evaluation problems live in tier 3. Most production tier-3 evaluations are conducted by humans — small panels, narrow rubrics, structured comparisons. The orchestration layer feeds the panel the right cases.

Where this falls down

Two cases. Open-ended creative work, where the task itself is the heavy reasoning and tiering buys nothing — you just need the heaviest model. And small-volume systems, where the overhead of building three tiers is not worth the cost savings on a hundred calls a day.

Closing

The interesting design problem in production AI is not which model is best. It is how to compose the tiers and write the routing rules that decide which to invoke. The orchestration layer is the new architecture. The models, in a year, will be commodity. The orchestration will not.