An LLM Evaluation Framework That Works

How to systematically evaluate LLM applications with a practical framework covering automated metrics, human evaluation, and continuous monitoring.

If the first post in this series was about how to build a RAG system, this one is about the part most teams skip until they cannot. How to actually know whether your LLM application is good, and getting better.

The honest state of the industry is that most LLM applications are tested by their authors squinting at outputs and saying "yeah, that looks better." This is fine for prototypes. It is not fine for anything serving real users. Vibes do not catch regressions. Vibes do not survive a model upgrade. Vibes do not scale past one engineer.

This post is the framework I use to replace vibes with something measurable.

The shape of LLM evaluation

LLM evaluation is different from traditional software testing in three ways, and missing any of these makes the rest fail:

Outputs are non-deterministic. The same input can yield different outputs.
Correctness is often graded, not binary. A response can be partially right.
Ground truth is expensive. Labeling is human work, and human work is slow.

A good framework respects all three. A bad framework pretends one of them is not true.

Five layers of evaluation

I think about evaluation in five layers, from cheapest and fastest to most expensive and slowest. You want all of them, but they answer different questions.

Unit-style assertions

For things that are deterministic enough to be tested deterministically, do that. Format constraints, presence of required fields, schema validation, basic refusal behavior. These are essentially traditional unit tests over LLM outputs.

In practice, these catch a depressing number of bugs. If your application requires structured JSON, an assertion that json.loads succeeds is worth more than a hundred quality scores. If you are using structured outputs or function calling, lean on it.

Reference-based metrics

For tasks with a known correct answer, you can compare against it. Exact match for short answers, fuzzy match or semantic similarity for longer ones. BLEU, ROUGE, and similar metrics have well-known limitations, but they are useful as cheap regression detectors.

The trick is to not treat the score as ground truth. Treat it as a smoke alarm. A 5 percent drop in similarity is a signal to look, not a verdict.

LLM-as-judge

For tasks where reference-based metrics are weak (open-ended generation, summarization, helpfulness), a strong model can grade outputs against a rubric. This works better than people expect, and worse than vendors claim.

A few rules I follow:

Use a different model than the one being graded when you can afford to.
Constrain the rubric. Specific criteria with worked examples beat vague ones.
Calibrate against humans periodically. If the judge starts disagreeing with your humans on labeled examples, retire the prompt.
Use pairwise comparisons for ranking, not absolute scores. Models are better at "is A better than B" than "rate A from 1 to 5."

Human evaluation

The gold standard, but expensive. Used wisely, you do not need much of it. A small set of carefully chosen examples, reviewed by a knowledgeable human, gives you the calibration data that everything else depends on.

What I usually set up: a labeled test set of 100 to 300 examples representative of production traffic, with quality scores from at least one human. This becomes the anchor against which automated metrics are validated.

Production observability

The previous four layers happen pre-deployment. Production observability is what tells you whether the model is still good once real users arrive.

The minimum I want shipping with any LLM application:

Sample logging. Log a percentage of inputs, outputs, retrieved context, and latency.
User feedback hooks. A thumbs up or down, or a report this answer path. Use it.
Drift monitoring. Track distribution of inputs and outputs over time. Sudden changes mean either user behavior shifted or the model did.
Replay tooling. A way to take any production trace and rerun it through a candidate version.

Without replay, every "the AI got worse" report becomes an unwinnable argument. With it, you can prove or disprove the regression in minutes.

How the layers fit together

In a healthy team, the layers run on different cadences:

Every PR. Unit-style assertions and reference-based metrics on the labeled test set.
Every release candidate. LLM-as-judge runs against the test set, plus a fresh sample of production traffic.
Weekly. Human review of a small batch of production samples. Calibrate the judge.
Continuously. Production observability and drift monitoring.

Most teams I have worked with try to skip to step three immediately, find it expensive, and abandon evaluation entirely. The trick is to start at step one and earn the right to the more expensive layers as the system matures.

A note on benchmarks

Public benchmarks are useful for choosing a model. They are almost never useful for choosing whether your application is good. Build a private test set that reflects your actual users. From projects I have seen, a curated set of 200 real, anonymized questions beats every off the shelf benchmark for product decisions.

What this looks like in practice

The teams that ship reliable LLM features all have some version of this framework. The teams that ship features that quietly degrade do not. The difference is not talent. It is the willingness to invest in plumbing before there is a fire.

If you are putting together evaluation for a real system and want a second pair of eyes, /services lists the engagements where this is the bulk of the work. The next post in the Production AI series goes deeper on prompt engineering as code, which is the discipline that makes all of this maintainable.

An LLM Evaluation Framework That Works

The shape of LLM evaluation

Five layers of evaluation

Unit-style assertions

Reference-based metrics

LLM-as-judge

Human evaluation

Production observability

How the layers fit together

A note on benchmarks

What this looks like in practice

References

Prompt Engineering for Production

More in technical

Building Production RAG Systems

Prompt Engineering for Production

Serverless at Scale: Patterns and Pitfalls

Want to discuss this topic?