Evaluating LLM Outputs Automatically When You Have No Ground Truth
You've shipped a RAG pipeline or a summarization feature and someone asks, "How do we know the model is doing a good job?" You reach for a benchmark dataset β and realize you don't have one. Building ground-truth labels takes weeks, domain experts cost money, and your product is already in front of users. You need a way to measure quality now.
This is the normal situation. Most LLM evaluation literature assumes clean labeled data, but in practice you're flying without that net. The good news: there are principled techniques that give you a real signal without a single hand-labeled example.
What you'll learn
- Why traditional metrics fail for open-ended LLM outputs
- How to use another LLM as an automated judge
- How consistency and self-consistency checks surface quality problems
- How to design rubric-based scoring that your team can actually interpret
- How to detect output degradation in production without labeled data
Why Traditional Metrics Break Down
Metrics like BLEU or ROUGE were designed for tasks with a narrow set of correct answers β machine translation, where "the cat sat on the mat" is close to "a cat sat on the mat." For open-ended generation, a response can be completely correct and score near zero against a reference that happens to use different phrasing.
Accuracy isn't applicable when there's no label. Perplexity tells you something about fluency but nothing about factual correctness or helpfulness. You need metrics that capture what actually matters to your users: is the answer accurate, relevant, complete, and not harmful?
The LLM-as-Judge Pattern
The most practical technique available right now is using a capable LLM to evaluate the output of another LLM (or the same one). The core idea: write a structured prompt that defines your quality criteria and asks the judge model to score or classify a given output.
Here's a minimal working example using the Anthropic API:
import anthropic
import json
client = anthropic.Anthropic()
JUDGE_PROMPT = """
You are an impartial evaluator. Given a user question and a model response,
score the response on the following criteria. Return a JSON object only.
Criteria:
- relevance: Does the response directly address the question? (1-5)
- accuracy: Is the factual content plausible and internally consistent? (1-5)
- completeness: Does it cover the key aspects the question requires? (1-5)
- conciseness: Is it appropriately brief without omitting important detail? (1-5)
User question: {question}
Model response: {response}
Return ONLY a JSON object like:
{{"relevance": 4, "accuracy": 3, "completeness": 5, "conciseness": 4, "rationale": "brief explanation"}}
"""
def evaluate_response(question: str, response: str) -> dict:
prompt = JUDGE_PROMPT.format(question=question, response=response)
message = client.messages.create(
model="claude-opus-4-5",
max_tokens=512,
messages=[{"role": "user", "content": prompt}]
)
raw = message.content[0].text.strip()
return json.loads(raw)
# Example usage
result = evaluate_response(
question="What is gradient descent?",
response="Gradient descent is an optimization algorithm that adjusts model parameters by moving in the direction of steepest decrease of the loss function."
)
print(result)
# {"relevance": 5, "accuracy": 5, "completeness": 3, "conciseness": 5, "rationale": "Accurate but doesn't mention learning rate or iteration."}
A few things to notice: the rubric is explicit, the output format is constrained to JSON, and you're capturing a rationale β which is useful for debugging when scores drop.
Choosing Your Judge Model
Use a model that is at least as capable as the model you're evaluating. Using a weaker model as a judge introduces systematic blind spots: it can't recognize errors it would make itself. If you're evaluating GPT-4-class outputs, you want a GPT-4-class or better judge. Claude 3 Opus and GPT-4o are common choices for this role.
Also be aware of self-serving bias. A model asked to judge its own outputs will rate them higher than a different model would. When possible, use a different model family as judge than the one generating answers.
Consistency Checks: Asking the Same Question Different Ways
If a model genuinely knows something, it should give consistent answers when you rephrase the question. If it's guessing or hallucinating, rephrasing tends to produce different or contradictory outputs. This is the principle behind self-consistency evaluation.
import anthropic
client = anthropic.Anthropic()
def generate_paraphrases(question: str, n: int = 3) -> list[str]:
"""Use an LLM to paraphrase the original question."""
prompt = f"""Generate {n} different phrasings of this question.
Return one per line, no numbering.
Question: {question}"""
msg = client.messages.create(
model="claude-haiku-4-5",
max_tokens=256,
messages=[{"role": "user", "content": prompt}]
)
return [line.strip() for line in msg.content[0].text.strip().split("\n") if line.strip()]
def check_consistency(original_question: str, get_answer_fn) -> dict:
paraphrases = generate_paraphrases(original_question)
answers = [get_answer_fn(q) for q in [original_question] + paraphrases]
# Ask the judge whether the answers agree in substance
comparison_prompt = f"""Here are {len(answers)} answers to semantically equivalent questions.
Do they agree in substance? Reply with 'consistent', 'partially_consistent', or 'inconsistent',
then a brief explanation.
Answers:
" + "\n---\n".join(answers)
msg = client.messages.create(
model="claude-opus-4-5",
max_tokens=256,
messages=[{"role": "user", "content": comparison_prompt}]
)
return {"verdict": msg.content[0].text.strip(), "paraphrases_tested": paraphrases}
Inconsistency doesn't automatically mean the answer is wrong, but it's a reliable flag for human review. Track the inconsistency rate over time β a sudden spike often means a prompt regression or a model update has changed behavior.
Rubric-Based Scoring at Scale
A rubric is just a written specification of what "good" looks like for your specific use case. The power of rubric-based evaluation is that it converts a fuzzy human judgment into a repeatable automated check.
Different tasks need different rubrics. A customer support bot rubric should probably include "tone" and "policy compliance." A code generation rubric should weight "correctness" and "security" heavily. A summarization rubric should focus on "faithfulness" β does the summary introduce claims not in the source?
Designing a Rubric
Start by having two or three people on your team independently rate 20β30 outputs on dimensions that feel relevant. Then look for where they disagree. Those disagreements are where your rubric definitions are vague. Tighten the language until inter-rater agreement is high, then encode that language into your judge prompt.
{
"task": "customer_support_response",
"dimensions": [
{
"name": "policy_compliance",
"description": "Does the response follow company policy? 1 = violates policy, 5 = fully compliant",
"weight": 0.4
},
{
"name": "helpfulness",
"description": "Does the response resolve or meaningfully progress the user's issue? 1 = unhelpful, 5 = fully resolves",
"weight": 0.4
},
{
"name": "tone",
"description": "Is the tone professional and empathetic? 1 = rude or cold, 5 = warm and professional",
"weight": 0.2
}
]
}
Once your rubric is stable, the judge prompt just embeds it verbatim. The weighted composite score lets you track a single number over time without losing dimensional detail.
Reference-Free Faithfulness for RAG Systems
If you're running a retrieval-augmented system, you have a built-in ground truth you're probably ignoring: the retrieved documents themselves. You don't know the perfect answer, but you do know what source material the model had access to. You can ask a judge whether the response is faithful to those sources.
FAITHFULNESS_PROMPT = """
You are evaluating whether a model response is supported by the provided context.
Context (what the model was given):
{context}
Model response:
{response}
For each factual claim in the model response, determine whether it is:
- Supported: directly stated or clearly implied by the context
- Unsupported: not present in the context (may still be true, but not grounded here)
- Contradicted: conflicts with the context
Return a JSON object:
{{"supported_claims": [...], "unsupported_claims": [...], "contradicted_claims": [...], "faithfulness_score": 0.0}}
Faithfulness score = supported / (supported + unsupported + contradicted), rounded to 2 decimal places.
"""
A faithfulness score below roughly 0.7 usually indicates the model is hallucinating beyond its context. Track this metric per retrieval source or per query category to find where your retrieval is failing rather than your generation model.
Detecting Degradation in Production
Evaluation isn't just a pre-launch concern. Models get updated, prompts drift, data distributions shift. You need continuous monitoring even when you still have no labeled data.
The practical approach: run a fixed set of synthetic "probe" queries through your system on a schedule and evaluate them automatically. These aren't real user queries β they're designed to test specific capabilities or edge cases you care about.
- Canary queries: Questions with known correct answers from widely accepted sources (e.g., "What does HTTP stand for?"). Any wrong answer is an immediate alert.
- Adversarial probes: Queries designed to elicit refusals, hallucinations, or policy violations if something goes wrong.
- Consistency probes: Paraphrase pairs where you track the consistency rate over time.
- Score distribution monitoring: Even if individual scores are noisy, a shift in the distribution of rubric scores across real traffic is a reliable degradation signal.
Store every probe result with a timestamp and model/prompt version. You want to correlate score changes with deployment events.
Common Pitfalls to Avoid
Position bias in pairwise comparisons. If you ask a judge to pick the better of two responses (A vs B), LLM judges systematically prefer whichever response appears first. Always run pairwise comparisons in both orders and average the results, or switch to absolute scoring with rubrics instead.
Rubric gaming. A sufficiently capable model being evaluated will produce outputs that score well on your rubric without actually being useful. This happens when your rubric rewards surface features (length, structure) over substance. Keep your rubric anchored to concrete user outcomes, not format signals.
Judge model sycophancy. Some judge models tend to rate outputs as better when they're longer, more confident, or contain more caveats. Test this by deliberately submitting a verbose but wrong answer and checking whether it scores lower than a concise correct one.
Single-metric collapse. Optimizing a single aggregate score across all your dimensions will cause the model to trade off dimensions you actually care about. Report dimensional scores separately in your dashboards, not just the composite.
Treating automated scores as ground truth. Automated evaluation gives you signal, not truth. Sample real outputs regularly and have a human verify that your automated scores track what humans would actually care about. If they drift apart, retune your rubric.
Putting It Together: A Minimal Evaluation Pipeline
Here's the shape of a production-ready no-ground-truth evaluation loop:
- For every LLM response, run it through your rubric-based judge and store the dimensional scores and rationale.
- For RAG responses, additionally compute a faithfulness score against the retrieved context.
- Run consistency checks on a random 5β10% sample of traffic by paraphrasing the input and comparing outputs.
- Run your canary query suite every 6 hours. Alert on any wrong answer or a composite score drop of more than a defined threshold versus the rolling baseline.
- Weekly: pull a random sample of 20β30 real responses, have a team member rate them manually, and compute correlation with your automated scores. If correlation drops, revisit your rubric.
Wrapping Up
Automated evaluation without ground truth is not a compromise β it's the right approach for most production LLM systems. Ground-truth datasets go stale, are expensive to build, and rarely cover the full distribution of real queries anyway. A well-designed rubric plus an LLM judge, combined with consistency checks and production monitoring, gives you a durable quality signal that improves over time.
Concrete next steps:
- Pick one output type your system produces today and write a 3β5 dimension rubric for it. Have two people apply it manually to 20 examples and check that you agree.
- Implement a basic LLM-as-judge prompt using the rubric. Run it against the same 20 examples and compare the automated scores to your manual ones.
- If you're running a RAG system, add faithfulness scoring this week. It's the cheapest high-signal metric you're probably not collecting.
- Set up a canary query suite with at least 10 probe questions and schedule it to run daily against your production endpoint.
- Pick a model version and freeze it as your judge. Document the judge model, judge prompt version, and rubric version together so you can reproduce historical scores later.
π€ Share this article
Sign in to saveRelated Articles
Comments (0)
No comments yet. Be the first!