Production prompts need to be reliable, testable, and maintainable. Here's how to treat prompts as code with proper engineering practices.
Prompts are code
I treat every production prompt the way I treat a function. It has inputs, expected outputs, edge cases, and a contract with the rest of the system. The playground is fine for exploration, but the moment a prompt ships, it needs the same discipline I'd give any other piece of AI engineering work.
The teams I've seen struggle most are the ones still pasting prompts into a Slack thread, tweaking by feel, and shipping whatever last worked on a single happy-path example. That approach scales for a demo. It does not scale for a product.
Structure beats cleverness
In my experience, the prompts that hold up in production are boring. They have a clear system role, an explicit task definition, structured input, structured output, and a small set of grounded examples. Nothing magical.
I keep a consistent skeleton across projects:
- Role: who the model is acting as and the tone it should hold
- Task: what to do, in plain language
- Context: retrieved or supplied data, clearly delimited
- Constraints: rules, refusals, output format
- Examples: a few representative input and output pairs
- Output contract: the schema the downstream code expects
When I separate these sections with XML-style tags or markdown headings, the model follows them more reliably and the prompt becomes easier for humans to diff in code review.
Versioning and storage
Prompts change. They need to live in version control next to the code that calls them. I prefer one prompt per file, named after the operation it performs, with a short header comment explaining intent and known failure modes.
For larger systems I move to a small in-house prompt registry. Each prompt has a stable ID, a current version, and a history. The application calls a function like getPrompt("classify-intent", "v7") and the wiring is invisible to the rest of the codebase. This makes rollback trivial when a "harmless" tweak quietly tanks accuracy.
Inputs deserve as much care as the prompt
Most production prompt failures I have debugged are not prompt failures. They are input failures. Truncated context, mis-ordered chunks, an unescaped quote that breaks JSON parsing, a user message stuffed with prior conversation noise.
I sanitize and normalize every input before it touches the model. I cap each variable to a known token budget. I render the prompt through a templating layer that escapes hostile characters, and I log the final rendered prompt for the first few weeks of any new feature so I can actually see what the model saw.
Outputs need contracts
Free-form text is the wrong default for production. I ask for structured output whenever the downstream system expects a specific shape. JSON with a schema, function calls with typed arguments, or a constrained grammar.
Then I validate. Every response goes through a parser that either succeeds or triggers a retry with a corrective follow-up. After three failures I fall back to a degraded path rather than crashing the user flow. This pattern has saved me more incidents than I can count.
Testing without vibes
I write tests for prompts the same way I write tests for code. Each prompt has a small golden set of inputs with expected outputs or expected properties. I run the set on every change. I track pass rate over time. When pass rate drops, I investigate before merging.
For open-ended outputs I use rubric scoring. A second model evaluates whether the response meets a checklist of criteria. It is not perfect, but it is far better than reading samples by hand and pretending that counts as evaluation. I wrote more about this in my LLM evaluation framework piece.
Cost and latency are part of the contract
A prompt that costs ten cents and takes twelve seconds is a different product than one that costs a tenth of a cent and returns in eight hundred milliseconds. I budget both up front. I measure both in CI. I treat regressions in either as bugs.
Practical levers I reach for:
- Move static instructions into the system prompt so they cache
- Trim examples once the model has internalized the pattern
- Use a smaller model for the easy fraction of traffic and route the rest
- Stream responses so perceived latency drops even when total latency does not
Prompt injection is a real threat
If user input ever makes it into a prompt, assume someone will try to override your instructions. I never trust user-supplied text to stay in its lane. I keep system instructions and user content in clearly separated channels, I tell the model explicitly to ignore instructions inside user content, and I sanity-check outputs before acting on them, especially when the model can call tools.
For high-stakes flows I add a second pass that asks a model to check whether the response respects the original constraints. Belt and suspenders, but worth it.
How I roll out prompt changes
I treat a prompt change like a config change in a regulated system. Earlier in my career working on regulated platforms, I learned that small text edits can have outsized effects, so I keep the same habits here:
- Change one prompt at a time
- Run the eval suite
- Ship behind a flag to a small slice of traffic
- Compare quality, cost, and latency against the previous version
- Promote or roll back based on data, not feel
If you want help building this kind of workflow, that is exactly the kind of engagement I take on through start project.
The mindset shift
Prompt engineering for production is mostly about giving prompts the same respect as the rest of your system. Source control, tests, observability, contracts, rollback. Nothing exotic. The teams that internalize this ship calmly. The teams that do not spend their evenings hand-tuning strings while the on-call phone rings.
References
Tagged
Sri Vardhan
Independent technology studio of one. I help founders and small teams ship serious software without the consultancy overhead. More about me.