Designing AI features that survive production

Most AI features work in demos. Few survive their first quarter in production. The gap is not intelligence — it is governance, evaluation, and the discipline to treat non-determinism as a first-class engineering concern.

Move guardrails out of prompts

The instinct is to pack safety, formatting, and business rules into the system prompt. That works until someone edits the prompt and silently breaks a compliance requirement.

The pattern that scales is an LLM gateway — a thin infrastructure layer between your application and the model provider. It centralises credentials, enforces rate limits, applies content moderation, and validates outputs with automatic fallbacks. Prompt logic stays focused on the task. Safety logic lives in infrastructure where it is auditable and version-controlled independently.

Every AI project becomes an evaluation project

Classical unit tests assert exact outputs. LLMs are non-deterministic — you cannot assert that the answer will be exactly this string. Evaluation needs a different approach.

The practical pattern is LLM-as-a-Judge in CI/CD. A second model scores the primary model's output against a rubric. Every prompt change triggers a batch evaluation against a golden dataset. If quality regresses, the build fails — just like a broken test.

Score across multiple dimensions: relevance, coherence, safety, faithfulness for RAG, and user satisfaction. A single accuracy number hides more than it reveals.

Treat prompts like deployable software

Prompts are not configuration. They are code that runs on a probabilistic runtime. Treat them accordingly.

Semantic versioning. Major for structural changes or model swaps. Minor for new features. Patch for wording tweaks.
Staged deployment. Dev → staging → production, with evaluation gates between each environment.
Progressive rollout. Feature flags to expose new prompts to 5% of users first, then expand. When OpenAI shipped a sycophancy-inducing prompt update to 100% of ChatGPT users without canary testing, social media became their alerting system.

Avoid the God Prompt

A 2,000-token mega-prompt that tries to handle every edge case performs worse on common inputs than a focused, composable set of smaller prompts. When you feel the prompt growing, decompose it. Route different intents to specialised prompts. Each one stays testable and versioned independently.

Classical resilience still applies

LLM APIs go down. Latency spikes. Rate limits hit. Apply the same patterns you would to any distributed system:

Circuit breakers that route to fallbacks (cached response, simpler model, graceful degradation UI) when failure rates spike.
Automatic model failover across providers — if one is down, route to another.
Semantic caching for near-identical queries to reduce latency and cost.
Never expose logic-layer LLMs directly to users. Add an intermediate layer that filters, validates, and transforms.

The bottom line

The teams that succeed in production treat AI features with the same rigour as any distributed system — adapted for non-determinism. Infrastructure-level guardrails, evaluation as a CI/CD citizen, semantic versioning for prompts, progressive rollouts, and classical resilience patterns. None of this is novel. The discipline to actually do it is what separates demos from products.