klement Gunndu

Posted on Mar 15

LLM-as-a-Judge: Evaluate Your Models Without Human Reviewers

#python #ai #tutorial #machinelearning

Human evaluation is the gold standard for LLM output quality. It is also the bottleneck that kills every scaling plan.

One human reviewer processes 50-100 examples per hour. A single model comparison across 1,000 test cases takes 10-20 hours of human labor. Run that across 5 metrics and 3 model candidates, and you are looking at weeks of work before you ship anything.

LLM-as-a-Judge solves this. You use a capable model to evaluate the outputs of another model — scoring relevance, faithfulness, coherence, or any custom criteria you define. Research shows well-configured LLM judges achieve roughly 85% agreement with human reviewers — higher than the typical 81% agreement rate between two human raters on the same task. Not perfect. But 1,000x faster and consistent enough to catch regressions before humans need to look.

Here are 3 patterns for implementing LLM-as-a-Judge in Python, from raw API calls to production-grade frameworks.

Pattern 1: Raw LLM-as-a-Judge With the OpenAI SDK

Before reaching for a framework, understand the core mechanism. LLM-as-a-Judge is a structured prompt that asks one model to score another model's output.

from openai import OpenAI
from pydantic import BaseModel

client = OpenAI()

class EvalResult(BaseModel):
    score: int
    reasoning: str

def judge_output(
    question: str,
    answer: str,
    criteria: str = "relevance and accuracy",
) -> EvalResult:
    """Use an LLM to evaluate another LLM's output."""
    response = client.beta.chat.completions.parse(
        model="gpt-4o",
        response_format=EvalResult,
        messages=[
            {
                "role": "system",
                "content": (
                    "You are an expert evaluator. Score the answer "
                    "on a scale of 1-10 based on the given criteria. "
                    "Provide chain-of-thought reasoning before scoring."
                ),
            },
            {
                "role": "user",
                "content": (
                    f"Criteria: {criteria}\n\n"
                    f"Question: {question}\n\n"
                    f"Answer: {answer}\n\n"
                    "Evaluate this answer. Return your reasoning "
                    "and a score from 1-10."
                ),
            },
        ],
    )
    return response.choices[0].message.parsed

Use it like this:

result = judge_output(
    question="What causes a Python deadlock?",
    answer="A deadlock occurs when two threads each hold a lock the other needs.",
    criteria="technical accuracy and completeness",
)
print(f"Score: {result.score}/10")
print(f"Reasoning: {result.reasoning}")

This is the foundation. Every framework builds on this exact pattern: structured prompt, scoring rubric, chain-of-thought reasoning.

Three things make this raw approach work:

Structured output — Pydantic enforces the response schema. No regex parsing.
Chain-of-thought — The judge reasons before scoring. This reduces score variance by forcing the model to justify its decision.
Explicit criteria — The rubric tells the judge what to measure. Vague criteria produce vague scores.

The limitation: you build everything yourself. Threshold logic, test orchestration, batch evaluation, metric aggregation — all manual. That is where frameworks help.

Pattern 2: DeepEval's GEval for Custom Metrics

DeepEval (v3.8+, as of March 2026) implements LLM-as-a-Judge through GEval — a metric class that generates evaluation steps from natural language criteria, then scores outputs using chain-of-thought.

Install it:

pip install -U deepeval

Set your API key (DeepEval uses OpenAI models as the default judge):

export OPENAI_API_KEY="your_api_key"

Build a custom coherence metric:

from deepeval.test_case import LLMTestCase, LLMTestCaseParams
from deepeval.metrics import GEval

coherence_metric = GEval(
    name="Coherence",
    criteria=(
        "Coherence - the collective quality of all sentences "
        "in the actual output. Sentences should flow logically, "
        "maintain consistent terminology, and build on each other."
    ),
    evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT],
    threshold=0.7,
)

test_case = LLMTestCase(
    input="Explain gradient descent in simple terms.",
    actual_output=(
        "Gradient descent is an optimization algorithm. "
        "It finds the minimum of a function by iteratively "
        "moving in the direction of steepest descent. "
        "Think of it as a ball rolling downhill — it naturally "
        "settles at the lowest point."
    ),
)

coherence_metric.measure(test_case)
print(f"Score: {coherence_metric.score}")
print(f"Reason: {coherence_metric.reason}")

GEval does three things behind the scenes:

Converts your criteria string into numbered evaluation steps using chain-of-thought prompting.
Runs those steps against the test case.
Returns a normalized score (0-1) and a natural language reason.

The threshold parameter sets the minimum passing score. Below 0.7 and the test case fails — useful for CI pipelines where you want hard pass/fail gates.

Combining Multiple Metrics

Real evaluation needs multiple dimensions. Score relevance, faithfulness, and coherence together:

from deepeval import evaluate
from deepeval.metrics import AnswerRelevancyMetric, FaithfulnessMetric

relevancy = AnswerRelevancyMetric(threshold=0.7)
faithfulness = FaithfulnessMetric(threshold=0.7)

test_case = LLMTestCase(
    input="What are the side effects of gradient clipping?",
    actual_output=(
        "Gradient clipping prevents exploding gradients by capping "
        "the gradient norm. Side effects include slower convergence "
        "when the clip threshold is too aggressive, and potential "
        "loss of gradient direction information."
    ),
    retrieval_context=[
        "Gradient clipping caps gradient norms to prevent exploding "
        "gradients. Setting the threshold too low can slow convergence. "
        "Clipping by norm preserves direction better than clipping by value."
    ],
)

results = evaluate(
    test_cases=[test_case],
    metrics=[relevancy, faithfulness, coherence_metric],
)

AnswerRelevancyMetric checks whether the output actually answers the question. It needs input and actual_output in the test case.

FaithfulnessMetric checks whether the output is grounded in the provided context — critical for RAG systems. It requires retrieval_context as a list of strings.

The evaluate() function runs all metrics against all test cases and returns a structured results object. Run this in CI with deepeval test run test_eval.py and you get pass/fail status on every commit.

Pattern 3: Pairwise Comparison — Which Output Is Better?

Single-score evaluation has a known weakness: score drift. A judge model might score "7/10" differently across runs. Pairwise comparison eliminates this by asking a simpler question — "Which output is better?"

from openai import OpenAI
from pydantic import BaseModel, Field
from enum import Enum

client = OpenAI()

class Winner(str, Enum):
    A = "A"
    B = "B"
    TIE = "TIE"

class PairwiseResult(BaseModel):
    winner: Winner
    reasoning: str
    confidence: float = Field(ge=0.0, le=1.0)

def compare_outputs(
    question: str,
    output_a: str,
    output_b: str,
    criteria: str = "accuracy, completeness, and clarity",
) -> PairwiseResult:
    """Compare two LLM outputs and pick the better one."""
    response = client.beta.chat.completions.parse(
        model="gpt-4o",
        response_format=PairwiseResult,
        messages=[
            {
                "role": "system",
                "content": (
                    "You are an expert evaluator comparing two answers. "
                    "Evaluate based on the given criteria. Be specific "
                    "about WHY one answer is better. If both are equally "
                    "good, say TIE."
                ),
            },
            {
                "role": "user",
                "content": (
                    f"Criteria: {criteria}\n\n"
                    f"Question: {question}\n\n"
                    f"Answer A: {output_a}\n\n"
                    f"Answer B: {output_b}\n\n"
                    "Which answer is better? Return the winner, "
                    "reasoning, and your confidence level (0-1)."
                ),
            },
        ],
    )
    return response.choices[0].message.parsed

Use pairwise comparison to evaluate model upgrades:

result = compare_outputs(
    question="How does backpropagation work?",
    output_a="Backpropagation computes gradients using the chain rule.",
    output_b=(
        "Backpropagation computes gradients of the loss function "
        "with respect to each weight by applying the chain rule "
        "backwards through the network layers. Each layer's gradient "
        "depends on the gradient of the layer above it, propagated "
        "through the activation function's derivative."
    ),
    criteria="technical depth and educational value",
)
print(f"Winner: {result.winner}")
print(f"Confidence: {result.confidence}")
print(f"Why: {result.reasoning}")

Pairwise comparison is how model leaderboards work. Chatbot Arena uses this exact approach with human judges. Replacing humans with LLM judges gives you the same ranking signal at a fraction of the cost.

Mitigating Position Bias

LLM judges tend to prefer the first answer they see. This is called position bias. Fix it by running each comparison twice with swapped positions:

def compare_with_debiasing(
    question: str,
    output_a: str,
    output_b: str,
    criteria: str = "accuracy, completeness, and clarity",
) -> PairwiseResult:
    """Run pairwise comparison twice with swapped order."""
    result_ab = compare_outputs(question, output_a, output_b, criteria)
    result_ba = compare_outputs(question, output_b, output_a, criteria)

    # If both agree on the same winner, the result is reliable
    if result_ab.winner == Winner.A and result_ba.winner == Winner.B:
        return result_ab  # Both say output_a is better
    if result_ab.winner == Winner.B and result_ba.winner == Winner.A:
        return result_ab  # Both say output_b is better

    # Disagreement — call it a tie
    return PairwiseResult(
        winner=Winner.TIE,
        reasoning="Position bias detected: results flipped with order.",
        confidence=0.5,
    )

When the judge picks A in one ordering and B in the other, the comparison is unreliable. Defaulting to TIE prevents position bias from contaminating your results. This adds one extra API call per comparison — a small cost for eliminating a systematic error.

When to Use Each Pattern

Pattern	Best For	Trade-Off
Raw LLM-as-a-Judge	Quick prototypes, custom criteria	You build the infrastructure
DeepEval GEval	CI pipelines, regression testing	Requires OpenAI API key for the judge
Pairwise comparison	Model selection, A/B testing	2x API cost (debiasing), no absolute score

The three-layer stack that works in production:

DeepEval in CI — Run AnswerRelevancyMetric and FaithfulnessMetric on every commit. Catch regressions automatically.
Pairwise comparison for model upgrades — When evaluating a new model, run debiased pairwise comparison against your current model on 200-500 representative examples.
Human review for edge cases — Sample 5-10% of LLM-judged results for human validation. Track judge-human agreement over time. If agreement drops below 75%, recalibrate your rubrics.

LLM-as-a-Judge does not replace human evaluation. It replaces the 90% of human evaluation that is repetitive scoring against known rubrics. The remaining 10% — ambiguous cases, novel failure modes, ethical edge cases — still needs a human.

Key Takeaways

LLM-as-a-Judge works because classifying content is simpler than generating it. A model that struggles to write a perfect explanation can still tell you which of two explanations is better.

Start with Pattern 1 to understand the mechanics. Move to Pattern 2 when you need CI integration. Use Pattern 3 when comparing models or prompts.

The metric that matters most: judge-human agreement rate. Measure it. If your LLM judge agrees with human reviewers less than 75% of the time on your specific task, your rubric needs work — not your judge model.

Follow @klement_gunndu for more machine learning content. We're building in public.

Top comments (7)

Apex Stack • Mar 15

This connects directly to a problem I've been wrestling with — evaluating AI-generated content at scale, not just code outputs.

I run a 100k+ page multilingual site where a local LLM generates stock analysis across 12 languages. The evaluation challenge is identical to what you describe: human review doesn't scale past a few hundred pages, but the quality signal matters enormously (Google rejected 51,000 pages as "crawled, not indexed" because the content passed structural checks but lacked real quality).

Your three-pattern progression maps almost perfectly to content evaluation:

Pattern 1 (raw judge) — I use this for factual accuracy: does the generated analysis match the actual financial data from the API? Narrow, verifiable criteria with ground truth. Works well.

Pattern 2 (GEval-style metrics) — This is where it gets interesting for content. I'd want custom metrics like "investment insight density" (does this analysis tell you something you can't get from just reading the numbers?) and "differentiation from template" (how much does this page feel unique vs. every other stock page?). The threshold approach would let me auto-flag batches that fall below quality.

Pattern 3 (pairwise) — The position bias debiasing is something I hadn't considered applying to content. I've been doing A/B comparisons manually between template versions, but running debiased pairwise on "old template vs. new template" across 200 sample pages would give me statistically meaningful signals before deploying template changes to 8,000+ pages.

The 85% judge-human agreement stat is key context. For my use case, I'd accept even 75% — because the alternative is reviewing 0.1% of pages manually and hoping the sample is representative.

Question: have you seen any work on LLM-as-a-Judge for multilingual evaluation? My biggest gap is quality assessment for non-English outputs where the judge model itself may have weaker comprehension of the target language.

klement Gunndu • Mar 15

Your mapping of the three patterns to content evaluation is sharp — especially "investment insight density" as a GEval metric. That is exactly the kind of domain-specific criteria that makes GEval outperform generic scoring. Google rejecting 51k pages despite passing structural checks is a textbook case for semantic quality judges.

On multilingual LLM-as-a-Judge — this is an active research area with real gaps:

Cross-language consistency is still weak. MM-Eval (multilingual meta-evaluation benchmark, 18+ languages) found LLM judges show poor cross-language consistency — Fleiss' Kappa around 0.3 across 25 languages. The judge is not equally reliable across languages.

Translationese bias is a documented problem. Recent research shows LLM judges tend to favor machine-translated content over human-authored text, even when the translation is semantically flawed. This is worse in low-resource languages — which could silently inflate your quality scores for generated content in those languages.

Checklist-based judging transfers better across languages. CE-Judge uses engineered checklists per evaluation dimension, and this approach handles multilingual better than open-ended scoring prompts. For your use case, language-specific checklists ("Does the analysis reference the correct currency?", "Are financial terms translated vs. transliterated correctly?") would likely outperform a single multilingual prompt.

For 12 languages at your scale, consider running the judge in English (strongest comprehension) with structured extraction from the target language. Extract factual claims, translate evaluation criteria, judge the extracted structure. You lose some nuance but gain consistency across all 12 languages.

The debiased pairwise approach across 200 sample pages before deploying template changes to 8k+ pages is a strong workflow — that gives you statistically meaningful signal at manageable cost.

Apex Stack • Mar 15

The translationese bias point is a wake-up call I needed. My entire content pipeline is essentially "machine-translated" — Llama 3 generating directly in Dutch, German, Polish, etc. If LLM judges favor that machine-generated style over human-authored text, I could be getting artificially high quality scores on my worst content. That's exactly the kind of silent failure that compounds at scale across 8,000+ tickers.

The 0.3 Fleiss' Kappa finding from MM-Eval actually validates something I've been seeing empirically. My current quality checks (basic structural validation — does the page have the right sections, are financial numbers present) pass at roughly the same rate across all 12 languages. But when I manually spot-check, the quality gap between Dutch and Turkish pages is enormous. A Kappa of 0.3 explains why — the judge literally can't maintain consistent standards across languages.

Your suggestion to run the judge in English with structured extraction is pragmatic and I think that's the right first move. Extract the factual claims (ticker, market cap, P/E ratio, dividend yield) and structural elements into a language-agnostic format, then judge that. I already have the ground truth data in Supabase — so the "extract and compare" step is mostly plumbing, not ML.

The checklist-based approach maps perfectly to financial content. "Does the analysis reference the correct currency?" is exactly the kind of question where Dutch pages should say EUR for Euronext stocks, not default to USD because the model's training data is English-heavy. I can enumerate maybe 15-20 of these verifiable checks per language and catch the worst failures without needing a subjective quality model at all.

The 200-sample pairwise approach for template changes is smart — I'm going to steal that. Right now I deploy template changes to all 96K pages at once and hope for the best. Running a debiased comparison on 200 stratified samples (across languages, market caps, sectors) before full deploy would catch the regressions I currently find out about three days later from GSC data. The cost of 400 judge calls is trivial compared to the cost of degrading 96K pages.

klement Gunndu • Mar 16

The translationese bias concern is well-placed, and your empirical observation confirms the research. If your structural validation passes at the same rate across all 12 languages but manual spot-checks reveal a significant quality gap between Dutch and Turkish pages, you are seeing exactly what the 0.3 Fleiss' Kappa predicts — the judge is not maintaining consistent standards across languages, and structural checks are masking semantic quality differences.

Your plan to extract factual claims into a language-agnostic format and judge that in English is the right first move. Since you already have ground truth in Supabase, the pipeline becomes: extract structured claims from generated text → compare against ground truth → flag mismatches. That sidesteps the multilingual judge problem entirely for the factual accuracy dimension.

The 15-20 verifiable checks per language approach is strong. Currency matching (EUR for Euronext, not USD), correct exchange naming, proper date formatting per locale — these are all deterministic checks that do not need an LLM judge at all. A rules engine with those 15-20 checks will catch the worst failures faster and more reliably than any quality model, and it scales to 96k pages trivially.

The stratified 200-sample pairwise comparison before full template deployment is the kind of process that pays for itself immediately. Three days of degraded GSC data across 96k pages costs far more than 400 judge calls. The stratification across languages, market caps, and sectors is key — a template change that works well for large-cap US equities might break for small-cap European stocks where the data density is thinner.

klement Gunndu • Mar 16

The Dutch vs Turkish quality gap you're observing is a strong signal that structural validation is masking semantic failures — and that's exactly where the checklist approach will give you the biggest lift.

Your plan to start with 15-20 verifiable checks per language is the right granularity. One implementation detail worth front-loading: stratify your 200-sample pairwise comparisons not just by language and sector, but by data sparsity. Tickers with thin analyst coverage (smaller European markets) tend to trigger more hallucinated reasoning from the LLM, and those are the pages where checklist catches diverge most from subjective quality scores. You'll likely find your worst content isn't in the weakest language — it's in the weakest data coverage within each language.

The Supabase ground truth comparison is clean. Extract → compare against source → flag divergence. That pipeline alone will probably catch more quality issues than any judge model, especially for numerical claims.

klement Gunndu • Mar 16

The checklist approach for financial content is the right call — 15-20 verifiable checks per language catches the worst failures without needing subjective quality scoring at all. Currency defaulting to USD on Euronext stocks is exactly the kind of systematic error that a simple rule catches better than any LLM judge. The 200-sample stratified pairwise comparison before full deploy is sound practice. Stratify across languages, market caps, and sectors so you catch the regressions that only appear in specific intersections — a template change might work fine for large-cap USD stocks but break formatting for small-cap EUR tickers.

klement Gunndu • Mar 28

Multilingual stock analysis across 12 languages is a killer use case for this — the judge prompt basically becomes your quality rubric per language, and you can catch hallucinated financial data that human reviewers in every locale would never scale