LLM Eval Tooling

Why eval is non-negotiable

Without eval, every change to your prompt, retrieval pipeline, or model is a gamble. Eval turns LLM development from "vibes-based" to "data-driven": you can measure whether a change improved answer quality, detect regressions before they reach users, and compare models objectively. Eval is not a one-time step; it runs continuously as part of CI/CD.

The cost of skipping eval is invisible at first -- outputs look "fine" in ad-hoc testing -- but compounds over time. A prompt tweak that improves one class of questions may silently degrade another. A model upgrade may trade latency for quality in ways you only discover from user complaints. Eval gives you the data to make these trade-offs intentionally.

RAGAS for RAG pipelines

RAGAS provides metrics specifically designed for RAG evaluation. Setup and core usage are covered in the rag-oss.md recipe; this section focuses on advanced patterns.

Building an eval dataset

Start with 50-100 question-answer pairs that cover your domain's edge cases. Include:

Questions with single-document answers (straightforward retrieval)
Questions requiring multi-document synthesis
Questions that should return "I don't know" (no relevant context exists)
Adversarial questions that test boundary conditions

Store the dataset as a versioned JSON or CSV file alongside your codebase:

json

[
  {
    "question": "What is the maximum payload size for the ingest endpoint?",
    "ground_truth": "The maximum payload size is 10MB per request.",
    "context_source": "api-reference/ingest.md"
  },
  {
    "question": "How do I configure multi-region replication?",
    "ground_truth": "Multi-region replication is not currently supported.",
    "expected_behavior": "no_answer"
  }
]

Automated dataset generation

RAGAS can generate synthetic test questions from your documents:

python

from ragas.testset.generator import TestsetGenerator

generator = TestsetGenerator.from_langchain(llm, embeddings)
testset = generator.generate_with_langchain_docs(
    documents,
    test_size=50,
)

Review and curate the generated questions before using them as a test set. Automated generation is a starting point, not a finished dataset -- remove duplicates, fix incorrect ground truths, and add edge cases that the generator missed.

Running in CI

Wrap RAGAS evaluation in a pytest test that asserts minimum metric thresholds:

python

from ragas import evaluate
from ragas.metrics import faithfulness, context_precision, answer_relevancy

def test_rag_quality(rag_pipeline, eval_dataset):
    result = evaluate(
        dataset=eval_dataset,
        metrics=[faithfulness, context_precision, answer_relevancy],
    )
    assert result["faithfulness"] > 0.8, f"Faithfulness too low: {result['faithfulness']}"
    assert result["context_precision"] > 0.7, f"Context precision too low: {result['context_precision']}"
    assert result["answer_relevancy"] > 0.75, f"Answer relevancy too low: {result['answer_relevancy']}"

A failing eval blocks the PR. Set thresholds conservatively at first -- you can tighten them as the pipeline matures.

LangSmith tracing and eval

Tracing

LangSmith captures every LLM call, tool invocation, and chain step for debugging and analysis.

Setup:

bash

export LANGCHAIN_TRACING_V2=true
export LANGCHAIN_API_KEY=<key>

Once set, all LangChain and LangGraph calls are automatically traced. No code changes required.

Viewing traces

The LangSmith web UI shows:

A timeline of each step in the chain
Input and output for every LLM call
Latency breakdown per step
Token usage and estimated cost

Use traces to debug unexpected outputs: find the step where the chain diverged from expected behavior, inspect the input that caused it, and fix the prompt or retrieval logic.

Eval with LangSmith

Create a dataset in LangSmith, define evaluators, run your chain against the dataset, and view results in the UI:

Create a dataset -- upload question-answer pairs or collect them from production traces
Define evaluators -- LLM-as-judge, heuristic, or exact match
Run evaluation -- execute your chain against every example in the dataset
View results -- compare scores across runs, identify failing examples

LLM-as-judge

Use a strong model (Claude) to evaluate a weaker model's output. Define criteria:

Correctness -- does the answer match the ground truth?
Helpfulness -- does the answer address the user's intent?
Harmlessness -- does the answer avoid generating harmful content?

LangSmith provides built-in evaluator templates for common criteria. Customize the rubric to match your domain's requirements.

Custom evaluators

python

from langsmith.evaluation import evaluate

def my_evaluator(run, example):
    expected_keyword = example.outputs["keyword"]
    answer = run.outputs["answer"]
    score = 1 if expected_keyword in answer else 0
    return {"key": "keyword_match", "score": score}

evaluate(
    my_chain,
    data="my-dataset",
    evaluators=[my_evaluator],
)

Combine multiple evaluators in a single run to get a multi-dimensional quality score.

Annotation queues

Route traces to human reviewers for quality assurance. Human labels feed back into the eval dataset, creating a feedback loop:

Production traces are sampled and added to an annotation queue
Reviewers label each trace as correct, partially correct, or incorrect
Labeled examples are added to the eval dataset
The eval suite grows over time, covering more edge cases

Custom eval harnesses (pytest-style)

For teams that do not use LangSmith or need fully local eval.

Structure

tests/
  eval/
    conftest.py
    data/
      eval_questions.json
    results/
    test_rag_quality.py
    test_agent_behavior.py

Fixtures

A pytest fixture that loads the eval dataset and initializes the pipeline:

python

import json
import pytest

@pytest.fixture
def eval_dataset():
    with open("tests/eval/data/eval_questions.json") as f:
        return json.load(f)

@pytest.fixture
def rag_pipeline():
    # Initialize your RAG pipeline here
    from myapp.rag import create_pipeline
    return create_pipeline()

Test patterns

Exact match -- for factual queries with deterministic answers:

python

def test_factual_answer(rag_pipeline, eval_dataset):
    for example in eval_dataset:
        response = rag_pipeline.query(example["question"])
        assert response.answer == example["expected_answer"]

Contains check -- for open-ended queries:

python

def test_contains_keyword(rag_pipeline, eval_dataset):
    for example in eval_dataset:
        response = rag_pipeline.query(example["question"])
        assert example["keyword"] in response.answer.lower()

LLM-as-judge -- call a strong model to grade the response on a 1-5 scale with a rubric, assert the score meets a minimum threshold:

python

def test_llm_judge(rag_pipeline, eval_dataset, judge_model):
    for example in eval_dataset:
        response = rag_pipeline.query(example["question"])
        score = judge_model.grade(
            question=example["question"],
            answer=response.answer,
            ground_truth=example["ground_truth"],
            rubric="Score 1-5: correctness, completeness, conciseness",
        )
        assert score >= 3, f"Low score ({score}) for: {example['question']}"

Retrieval quality -- assert that the correct source document appears in the retrieved context:

python

def test_retrieval_quality(rag_pipeline, eval_dataset):
    for example in eval_dataset:
        response = rag_pipeline.query(example["question"])
        sources = [doc.metadata["source"] for doc in response.context]
        assert example["context_source"] in sources

Latency -- for user-facing pipelines:

python

def test_latency(rag_pipeline, eval_dataset):
    for example in eval_dataset:
        response = rag_pipeline.query(example["question"])
        assert response.latency_ms < 2000, f"Too slow: {response.latency_ms}ms"

Running

bash

pytest tests/eval/ -v --tb=short

Separate eval tests from unit tests: eval tests are slow (they call LLMs) and expensive. Run them in a dedicated CI stage, not on every commit.

Result tracking

Write eval results to a JSON file with timestamp, commit hash, and metrics. Maintain a tests/eval/results/ directory with historical results for trend analysis:

python

import json
import subprocess
from datetime import datetime

def save_eval_results(metrics: dict, output_dir: str = "tests/eval/results"):
    commit = subprocess.check_output(
        ["git", "rev-parse", "HEAD"]
    ).decode().strip()
    result = {
        "timestamp": datetime.utcnow().isoformat(),
        "commit": commit,
        "metrics": metrics,
    }
    path = f"{output_dir}/eval-{datetime.utcnow().strftime('%Y%m%d-%H%M%S')}.json"
    with open(path, "w") as f:
        json.dump(result, f, indent=2)

Model comparison

Evaluating multiple models on the same test set to make informed selection decisions.

Parameterized tests

python

import pytest

@pytest.mark.parametrize("model", [
    "claude-sonnet-4-20250514",
    "gpt-4o",
    "llama-3.1-70b",
])
def test_model_quality(model, eval_dataset, create_pipeline):
    pipeline = create_pipeline(model=model)
    scores = []
    for example in eval_dataset:
        response = pipeline.query(example["question"])
        score = grade_response(response, example)
        scores.append(score)
    avg_score = sum(scores) / len(scores)
    assert avg_score >= 3.0, f"{model} average score too low: {avg_score}"

Comparison table

Generate a markdown table from results for model selection decisions:

Model	Avg Score	Latency p50	Latency p95	Cost/Query
claude-sonnet-4-20250514	4.2	1200ms	2800ms	$0.012
gpt-4o	3.9	900ms	2100ms	$0.015
llama-3.1-70b	3.5	600ms	1400ms	$0.003

This table is the basis for model selection decisions. Quality, latency, and cost are the three axes; the right choice depends on your use case.

A/B testing in production

Route a percentage of traffic to the new model, compare eval metrics over 24-48 hours, then promote or rollback. Requirements:

Log inputs, outputs, and user feedback alongside model identifiers
Compare eval metrics (quality score, latency, error rate) between model variants
Set a minimum sample size before making a decision (at least 100 queries per variant)
Automate the promotion/rollback decision with threshold checks

Regression tracking

Detecting quality degradation over time.

Baseline

After initial development, run the full eval suite and record results as the baseline:

bash

pytest tests/eval/ -v --tb=short
# Save output to tests/eval/baselines/baseline-YYYY-MM-DD.json

Store baselines in tests/eval/baselines/baseline-YYYY-MM-DD.json.

Regression detection

In CI, compare current results against the baseline. Flag any metric that drops by more than a configurable threshold (e.g., 5% relative decrease):

python

def check_regression(current: dict, baseline: dict, threshold: float = 0.95):
    for metric, baseline_score in baseline["metrics"].items():
        current_score = current["metrics"][metric]
        assert current_score >= baseline_score * threshold, (
            f"Regression: {metric} dropped from {baseline_score:.3f} "
            f"to {current_score:.3f} ({(1 - current_score / baseline_score) * 100:.1f}% decrease)"
        )

Common causes of regression

Prompt changes -- a tweak that improves one category degrades another
Model updates -- API providers sometimes update models silently
Retrieval pipeline changes -- new chunking strategy, different embedding model
Data drift -- new documents change the embedding distribution

When a regression is detected, the eval result points to which metric regressed, which helps narrow the cause.

Dashboard

For teams with production LLM systems, export eval metrics to a monitoring system (Datadog, Grafana) and set up alerts on threshold violations. Track:

Per-metric scores over time (faithfulness, relevancy, precision)
Latency percentiles (p50, p95, p99)
Error rates (timeouts, malformed outputs, refusals)
Cost per query over time

Gotchas

LLM-as-judge is not deterministic. Running the same evaluation twice can produce different scores. Mitigate by averaging over 3 runs or using temperature=0.

RAGAS requires an LLM to compute metrics. By default it uses OpenAI. Configure it to use your preferred model to avoid unexpected API calls and costs.

Eval datasets drift. As your domain evolves, the eval dataset must be updated. Schedule quarterly reviews to add new edge cases, remove obsolete questions, and update ground truths.

Cost adds up. A 100-question eval suite calling a model 3 times per question (for judge, generation, and comparison) costs approximately $1-5 per run. Budget for this in CI costs. Consider running the full suite only on PR merges, with a smaller smoke-test suite on every commit.

Flaky evals erode trust. If an eval test fails intermittently due to non-determinism rather than a real regression, teams learn to ignore failures. Keep thresholds realistic and use averaging to smooth out noise.

compgeo · experimental

hpc · experimental

mle · experimental

llm · experimental

react

sveltekit

tanstack

cpp

python

typescript

LLM Eval Tooling

Why eval is non-negotiable

RAGAS for RAG pipelines

Building an eval dataset

Automated dataset generation

Running in CI

LangSmith tracing and eval

Tracing

Viewing traces

Eval with LangSmith

LLM-as-judge

Custom evaluators

Annotation queues

Custom eval harnesses (pytest-style)

Structure

Fixtures

Test patterns

Running

Result tracking

Model comparison

Parameterized tests

Comparison table

A/B testing in production

Regression tracking

Baseline

Regression detection

Common causes of regression

Dashboard

Gotchas

LLM Eval Tooling ​

Why eval is non-negotiable ​

RAGAS for RAG pipelines ​

Building an eval dataset ​

Automated dataset generation ​

Running in CI ​

LangSmith tracing and eval ​

Tracing ​

Viewing traces ​

Eval with LangSmith ​

LLM-as-judge ​

Custom evaluators ​

Annotation queues ​

Custom eval harnesses (pytest-style) ​

Structure ​

Fixtures ​

Test patterns ​

Running ​

Result tracking ​

Model comparison ​

Parameterized tests ​

Comparison table ​

A/B testing in production ​

Regression tracking ​

Baseline ​

Regression detection ​

Common causes of regression ​

Dashboard ​

Gotchas ​

LLM Eval Tooling

Why eval is non-negotiable

RAGAS for RAG pipelines

Building an eval dataset

Automated dataset generation

Running in CI

LangSmith tracing and eval

Tracing

Viewing traces

Eval with LangSmith

LLM-as-judge

Custom evaluators

Annotation queues

Custom eval harnesses (pytest-style)

Structure

Fixtures

Test patterns

Running

Result tracking

Model comparison

Parameterized tests

Comparison table

A/B testing in production

Regression tracking

Baseline

Regression detection

Common causes of regression

Dashboard

Gotchas