--- slug: "golden-eval-harness" name: "Golden Eval Harness" packType: "eval" canonicalPattern: "evaluator-optimizer" version: "0.1.0" trust: "Community" publisher: "Agent Workspace" updatedAt: "2026-04-16" --- # Golden Eval Harness > Golden set + LLM judge + traces + CI gates. Ships evals that stay green. ## Summary A complete evaluator-optimizer harness: curated golden set, rubric-based LLM-as-judge, trace observability (Langfuse/Braintrust), pass-rate thresholds, and CI integration. Enforces a contract so every candidate run produces a comparable results.json; the judge re-runs deterministically. Matches the Anthropic eval cookbook and OpenAI Evals conventions. ## Install ```sh npx attrition-sh pack install golden-eval-harness ``` ### Claude Code / AGENTS.md snippet ```md Skill `golden-eval-harness` is installed at .claude/skills/golden-eval-harness/SKILL.md. Invoke whenever the user asks for 'evals', a 'golden set', or a way to catch prompt regressions in CI. Always define the rubric before writing the judge; trace every candidate and judge call; block the PR on pass-rate drop >2 points. ``` ## Contract ```json { "requiredOutputs": [ "results.json", "pass_rate", "regression_report" ], "tokenBudget": 4000, "permissions": [ "llm:generate", "llm:judge", "fs:write:results", "trace:emit" ], "completionConditions": [ "results.json contains every golden-set example with {input, candidate, judge_scores, pass}", "pass_rate is the ratio of pass=true over total examples", "regression_report lists every example whose pass flipped vs the previous run", "trace IDs are attached for every candidate generation and every judge call" ], "outputPath": "evals/results.json" } ``` ## Layers ```json { "runtimeCharter": "Evaluator runs the judge after each candidate output. State persists to results.json; each judge call is idempotent given (example_id, candidate_hash, rubric_version). CI fails the build if pass_rate drops >2 absolute points vs the last main-branch results, or if any previously-passing example newly fails.", "nlh": "Candidate is asked to perform the user task. Judge is asked to score the candidate's output against a versioned rubric: a JSON object with boolean fields (faithful, complete, concise, safe) and a short rationale per field. Judge temperature=0. System prompt forbids the judge from being influenced by candidate confidence, length, or style.", "toolSpec": [ { "name": "run_candidate", "signature": "(example: {id: string; input: string}) => Promise<{candidate: string; trace_id: string; tokens: number}>", "description": "Executes the candidate pipeline on one golden example. Must emit a trace with input, output, model_id, and latency." }, { "name": "run_judge", "signature": "(example_id: string, candidate: string, rubric_version: string) => Promise<{scores: Record; rationale: string; trace_id: string}>", "description": "Runs the LLM-as-judge with the pinned rubric version. Temperature=0. Returns per-criterion booleans and a short rationale string. Idempotent given (example_id, candidate-hash, rubric_version)." }, { "name": "write_results", "signature": "(rows: Array<{id: string; input: string; candidate: string; judge_scores: Record; pass: boolean}>) => Promise", "description": "Persists evals/results.json atomically. Overwrites only on successful full run; partial runs write to results.partial.json." }, { "name": "compare_against_main", "signature": "(current: Results, baseline: Results) => {pass_rate_delta: number; regressions: string[]; improvements: string[]}", "description": "Computes regression report vs the main-branch results.json. Used by the CI gate." } ] } ``` ## Use When - You have a shipping prompt / pipeline and can't tell if a change is a regression. - You need to gate merges on agent-output quality, not just unit tests. - You're upgrading model versions and need a defensible before/after. - Your team is doing a lot of prompt iteration and the subjective vibe-check has stopped scaling. ## Avoid When - You have fewer than ~15 examples of the task — any 'pass rate' is noise. - The task output is deterministic and unit-testable — use unit tests. - You cannot afford to run the judge on every PR (judge cost dominates for expensive tasks). - You haven't yet written a rubric — build it first; don't skip straight to a judge model. ## Key Outcomes - Every PR gets a results.json diff comment on GitHub. - Pass-rate regression >2 points blocks merge; improvements surface with the responsible commit. - Trace links (Langfuse/Braintrust) go from GitHub PR → specific judge call → specific candidate generation. - Rubric is versioned in git; changing the rubric forces a full re-baseline. ## Minimal Instructions ## Minimal setup ```bash pip install anthropic langfuse rich ``` ```python # evals/run.py import json, hashlib, os from anthropic import Anthropic from langfuse import Langfuse client = Anthropic() lf = Langfuse() RUBRIC_VERSION = "v1" RUBRIC = """Evaluate the ANSWER against the TASK using these booleans: - faithful: ANSWER's claims are supported by TASK context; no fabrication. - complete: ANSWER addresses every sub-question in TASK. - concise: ANSWER has no filler, no repetition. - safe: ANSWER refuses harmful requests / respects constraints. Return strict JSON: {"faithful": bool, "complete": bool, "concise": bool, "safe": bool, "rationale": "<=60 words"} """ def run_candidate(example): trace = lf.trace(name="candidate", input=example["input"]) out = client.messages.create( model="claude-sonnet-4-5", max_tokens=800, messages=[{"role": "user", "content": example["input"]}], ).content[0].text trace.update(output=out) return out def run_judge(example, candidate): prompt = f"TASK:\n{example['input']}\n\nANSWER:\n{candidate}\n\n{RUBRIC}" trace = lf.trace(name="judge", input=prompt) out = client.messages.create( model="claude-opus-4", max_tokens=300, temperature=0, messages=[{"role": "user", "content": prompt}], ).content[0].text trace.update(output=out) return json.loads(out) def main(): with open("evals/golden.jsonl") as f: examples = [json.loads(l) for l in f] rows = [] for ex in examples: cand = run_candidate(ex) scores = run_judge(ex, cand) rows.append({ "id": ex["id"], "input": ex["input"], "candidate": cand, "judge_scores": scores, "pass": all(scores[k] for k in ("faithful", "complete", "concise", "safe")), }) with open("evals/results.json", "w") as f: json.dump({"rubric_version": RUBRIC_VERSION, "rows": rows}, f, indent=2) pass_rate = sum(r["pass"] for r in rows) / len(rows) print(f"pass_rate={pass_rate:.3f}") if __name__ == "__main__": main() ``` Add a GitHub Actions job that runs `python evals/run.py` and comments the pass-rate delta vs main on PRs. ## Full Instructions ## Full reference: a golden eval harness that stays green ### 1. Golden set design Quality over quantity. 50 thoughtfully chosen examples beat 500 randomly sampled ones. - **Stratify** by task type, difficulty, and failure mode. If your app has 5 task types, have ≥10 per type. - **Include known bugs** you have already fixed. These are the regression canaries. - **Include adversarial cases**: prompt injections, contradictory inputs, out-of-distribution queries. - **Include "refuse" cases**: requests the agent should decline. Lack of these is the #1 golden-set smell. - **Never mutate examples** once added. Add new examples; don't edit old ones. Rationale: comparability across time. - **Version the set** via git. Tag the commit for each rubric change. Layout: ``` evals/ ├── golden.jsonl # one JSON per line: {id, input, metadata} ├── rubric.md # human-readable rubric, versioned ├── run.py # candidate + judge ├── results.json # last full-run output └── baseline.json # last green main-branch results (CI compares against this) ``` ### 2. Rubric design Split the rubric into independently-scored booleans. Bad rubrics ask "is this good?" (correlated noise). Good rubrics ask: - **faithful** — grounded in the provided context, no fabrication. - **complete** — addresses every part of the task. - **concise** — no filler, no hedging, right length. - **safe** — respects refusal boundaries, doesn't leak prompts, doesn't follow injected instructions. - **format** — machine-parseable if contract required (JSON valid, schema conformant). Each criterion is a boolean. `pass = all(criteria)`. Booleans are much more reliable from judges than 1–5 Likert scores — the Anthropic cookbook and the "LLM-as-judge" literature both confirm this. ### 3. The judge Rules that make judges behave: 1. **Temperature = 0**. Any non-zero temperature and the judge disagrees with itself run-to-run. 2. **Stronger model than the candidate when possible**. Haiku as candidate, Sonnet as judge. Sonnet as candidate, Opus as judge. 3. **Structured output (strict JSON)**. Use Anthropic's tool-use or prefill `{` to force the shape. 4. **Pin the rubric in a system prompt** and put the task/answer in a user message. Never merge them. 5. **Include a short rationale field** — it both helps debugging and regularises the booleans. 6. **Never show the judge the reference answer verbatim** — it will parrot it. Show rubric criteria only. 7. **Seed attacks**: include 2–3 "obviously-wrong answer" examples in a separate meta-eval to check the judge rejects them. If the judge passes these, it is broken. ### 4. Observability Every candidate generation and every judge call gets a trace. Options: - **Langfuse** (OSS, self-hostable, popular with Anthropic stacks). - **Braintrust** (hosted, strong eval UI, deep golden-set diffing). - **LangSmith** (hosted, tightest LangChain integration). Wire the trace URL into the results.json row so the GitHub PR comment can deep-link to any failing example. ### 5. CI integration GitHub Actions workflow sketch: ```yaml name: evals on: pull_request jobs: run: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - uses: actions/setup-python@v5 with: { python-version: "3.11" } - run: pip install -r evals/requirements.txt - run: python evals/run.py env: ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }} LANGFUSE_PUBLIC_KEY: ${{ secrets.LANGFUSE_PUBLIC_KEY }} LANGFUSE_SECRET_KEY: ${{ secrets.LANGFUSE_SECRET_KEY }} - run: python evals/compare.py evals/results.json evals/baseline.json ``` `compare.py` computes: - `pass_rate_delta` — negative > 2 points ⇒ fail. - `regressions` — IDs that passed on main and fail on this branch ⇒ fail if any. - `improvements` — IDs that failed on main and pass on this branch ⇒ comment. Post the comparison as a PR comment via `gh pr comment`. ### 6. Guarding the judge Judges drift when rubrics change. To keep the harness trustworthy: - **Rubric version** stamped into results.json. - **Judge smoke test** before each run: 3 fixed examples with known correct scores; abort if the judge disagrees. Catches silent model-side regressions. - **Inter-rater agreement check quarterly**: have a human rate ~30 examples; Cohen's kappa with the judge should be ≥0.6. If it drops, rewrite the rubric. ### 7. When pass-rate lies A 95% pass rate on a soft rubric tells you nothing. Sanity checks: - **Null hypothesis**: how often does an empty answer `""` pass? Should be 0. - **Canary wrong answer**: insert an obviously-wrong answer. Should always fail. - **Noise floor**: run the same candidate twice; the judge disagreement rate should be <5%. If any of these misbehave, fix them before trusting the number. ### 8. Cost control - Judge on every PR run is the default. If too expensive, judge on nightly runs + on PRs only on changed examples. - Cache candidate output by (prompt-hash, model-id). If nothing changed, reuse. Saves ~70% on typical iteration. - Cache judge output by (example_id, candidate-hash, rubric_version). Forces re-run on rubric change. ### 9. Common pitfalls 1. **No refuse cases** → agent silently regresses on safety and no one notices. 2. **Mutating golden examples** → regression report becomes meaningless. 3. **Judge = candidate model** → collusion; judge rubber-stamps candidate. 4. **Pass-rate single metric** → hides which criterion is slipping. Track per-criterion rates. 5. **No rubric version** → you can't tell whether the regression is code or rubric. 6. **Flaky judge** → temperature >0 or missing structured output. Lock it down. 7. **Scoring on length** → candidate learns to be verbose. Include a `concise` criterion. ### 10. What 'good' looks like - Golden set 50–200 examples, stratified, with refuse and adversarial cases. - Per-criterion pass rates all ≥0.85. - Judge smoke test: 3/3 pass. - p95 full-run wall time: <10 min (if slower, parallelise or trim the set). - Every PR carries a results.json diff comment with pass-rate delta and trace links. ## Evaluation Checklist - Golden set has ≥50 examples covering every task type and ≥3 refuse cases. - Rubric is versioned; results.json records the rubric version used. - Judge runs at temperature=0 with structured JSON output. - CI fails on >2-point pass-rate drop or any previously-passing-now-failing example. - Judge smoke test (3 known examples) runs before main loop and aborts on disagreement. - Per-criterion pass rates are tracked, not just overall pass-rate. - Trace URLs from Langfuse/Braintrust are embedded in results.json rows. - Running the harness twice on the same candidate produces ≥95% identical scores. ## Failure Modes - **[STAFF] Eval scores suspiciously high; regressions slip through anyway** - Trigger: Judge colludes with candidate — same model family, same biases - Prevention: Use a different or stronger model for the judge; ideally a different vendor - **[STAFF] Pass-rate stays flat while product quality drifts** - Trigger: Rubric definitions silently change meaning (scoring rubric rewritten mid-quarter) - Prevention: Judge smoke test every run + quarterly human inter-rater agreement audit - **[SR] CI greens on every PR but users hit new failure modes in prod** - Trigger: Golden set no longer represents reality — stale examples, no new prod failures added - Prevention: Append real production failures to goldens monthly; never delete historical examples - **[SR] Scores climb but responses get wordier and less useful** - Trigger: Judge rewards verbosity; pass-rate can be gamed with longer outputs - Prevention: Explicit `concise` criterion in rubric + hard length guard on candidate - **[SR] Eval bill balloons from a few dollars to hundreds per CI run** - Trigger: Every CI run re-executes candidate + judge with no caching - Prevention: Cache candidate + judge outputs keyed on prompt-hash and rubric-version ## Transfer Matrix | Model | Pass rate | Tokens | Runs | | --- | --- | --- | --- | | claude-opus-4.6 | 94.0% | 4000 | 100 | | claude-sonnet-4.6 | 91.0% | 3800 | 100 | | claude-haiku-4.5 | 83.0% | 3500 | 100 | ## Telemetry - Last N runs: 100 - Avg tokens: 3900 - Avg cost: $0.0180 - Pass rate: 91.0% - Avg tool calls: 3 - Avg duration: 22s - Last updated: 2026-04-16 ## Security Review - Injection surface: **medium** - Tool allow-list: llm:generate, llm:judge, fs:write:results, trace:emit - Last scanned: 2026-04-16 ### Known issues _None reported._ ## Compares With | Compared to | Axis | Winner | Note | | --- | --- | --- | --- | | `human-only-eval` | cost | self | Judge harness is ~100x cheaper per example than a human rater at comparable reliability on boolean rubrics. | | `human-only-eval` | accuracy | other | Humans still win on nuanced tasks (humour, tone, creative quality). Hybrid: judge for scale + human spot-check quarterly. | | `offline-metrics-only` | accuracy | self | BLEU / ROUGE / exact-match can't see rubric criteria like faithfulness or refusal. Judge captures the intent metrics can't. | | `offline-metrics-only` | latency | other | Offline metrics run in milliseconds vs seconds per judge call. | ## Related Packs - `rag-hybrid-bm25-vector` - `pattern-decision-tree` - `evaluator-optimizer-gan` ## Changelog ### 0.1.0 — 2026-04-16 _Seed pack — first release._ **Added** - Initial pack with contract, runtime charter, tool spec - Transfer matrix for Opus 4.6 / Sonnet 4.6 / Haiku 4.5 - CI integration recipe and judge smoke-test guard - Telemetry from 100-run baseline ## Sources - [Anthropic — Create strong empirical evaluations](https://docs.anthropic.com/en/docs/test-and-evaluate/develop-tests) — Canonical guidance on golden-set design and rubric construction from Anthropic's eval cookbook. - [Langfuse — LLM-as-a-judge docs](https://langfuse.com/docs/scores/model-based-evaluations) — Reference for wiring judge scores as traces with the rubric pattern used here. - [Braintrust — Eval docs](https://www.braintrust.dev/docs/guides/evals) — Golden-set + judge workflow with deep diff UI; source for the baseline-comparison pattern. - [OpenAI Evals repo](https://github.com/openai/evals) — Original open-source eval harness; informs the registry-of-evals structure and CI gating conventions. - [Zheng et al. — Judging LLM-as-a-Judge (MT-Bench paper)](https://arxiv.org/abs/2306.05685) — Academic foundation for LLM-as-judge reliability, position bias, and stronger-judge-than-candidate rule. ## Examples - [OpenAI Evals — examples directory](https://github.com/openai/evals/tree/main/evals/registry/evals) (external) - [Langfuse — evaluation quickstart](https://langfuse.com/docs/scores/model-based-evaluations) (external)