Golden Eval Harness
Golden set + LLM judge + traces + CI gates. Ships evals that stay green.
Install
One-line install
npx attrition-sh pack install golden-eval-harness
AGENTS.md snippet (Claude Code / Cursor)
Skill `golden-eval-harness` is installed at .claude/skills/golden-eval-harness/SKILL.md. Invoke whenever the user asks for 'evals', a 'golden set', or a way to catch prompt regressions in CI. Always define the rubric before writing the judge; trace every candidate and judge call; block the PR on pass-rate drop >2 points.
Raw Markdown
Machine-readable body for agent ingestion or copy/paste.
Telemetry
Measured on attrition.shPass rate
91%
Avg tokens
3,900
Avg cost
$0.018
Avg tool calls
3
Avg duration
22s
Sample size
100 runs
Updated
2026-04-16
Rediscovery cost
Skipping this saves ~55,000 tokens / 120 min of research.
MethodologyHide
Rediscovery cost
Skipping this saves ~55,000 tokens / 120 min of research.
Measured 2026-04-16
Prompted a fresh Claude Sonnet 4.6 with 'design an LLM evaluation harness for a production agent with CI gates'. Measured tokens until the output included versioned golden set, boolean rubric, temperature=0 judge, regression vs baseline, trace emission, and cost controls. Averaged over 3 runs.
Summary
A complete evaluator-optimizer harness: curated golden set, rubric-based LLM-as-judge, trace observability (Langfuse/Braintrust), pass-rate thresholds, and CI integration. Enforces a contract so every candidate run produces a comparable results.json; the judge re-runs deterministically. Matches the Anthropic eval cookbook and OpenAI Evals conventions.
Fit and expected payoff
When this pack earns its extra structure, when to skip it, and what it should improve.
Use when
Situations where this pack earns its extra structure.
- You have a shipping prompt / pipeline and can't tell if a change is a regression.
- You need to gate merges on agent-output quality, not just unit tests.
- You're upgrading model versions and need a defensible before/after.
- Your team is doing a lot of prompt iteration and the subjective vibe-check has stopped scaling.
Avoid when
Keeps the pack from becoming a default hammer.
- You have fewer than ~15 examples of the task — any 'pass rate' is noise.
- The task output is deterministic and unit-testable — use unit tests.
- You cannot afford to run the judge on every PR (judge cost dominates for expensive tasks).
- You haven't yet written a rubric — build it first; don't skip straight to a judge model.
What it improves
Expected outcomes if implemented well.
- Every PR gets a results.json diff comment on GitHub.
- Pass-rate regression >2 points blocks merge; improvements surface with the responsible commit.
- Trace links (Langfuse/Braintrust) go from GitHub PR → specific judge call → specific candidate generation.
- Rubric is versioned in git; changing the rubric forces a full re-baseline.
Bounded invocation surface
Turns fuzzy LLM calls into bounded agent invocations (Tongyi NLA pattern).
Required outputs
- results.json
- pass_rate
- regression_report
Permissions
- llm:generate
- llm:judge
- fs:write:results
- trace:emit
Completion conditions
- results.json contains every golden-set example with {input, candidate, judge_scores, pass}
- pass_rate is the ratio of pass=true over total examples
- regression_report lists every example whose pass flipped vs the previous run
- trace IDs are attached for every candidate generation and every judge call
Token budget
4,000
Output path
evals/results.json
Runtime charter, NLH, and tool spec
Split layers enable ablation — swap the NLH while fixing the charter, or vice versa.
Runtime charter
ExpandCollapse
Natural-language harness (NLH)
ExpandCollapse
Tool spec (4)
ExpandCollapse
| Name | Signature | Description |
|---|---|---|
| run_candidate | (example: {id: string; input: string}) => Promise<{candidate: string; trace_id: string; tokens: number}> | Executes the candidate pipeline on one golden example. Must emit a trace with input, output, model_id, and latency. |
| run_judge | (example_id: string, candidate: string, rubric_version: string) => Promise<{scores: Record<string, boolean>; rationale: string; trace_id: string}> | Runs the LLM-as-judge with the pinned rubric version. Temperature=0. Returns per-criterion booleans and a short rationale string. Idempotent given (example_id, candidate-hash, rubric_version). |
| write_results | (rows: Array<{id: string; input: string; candidate: string; judge_scores: Record<string, boolean>; pass: boolean}>) => Promise<void> | Persists evals/results.json atomically. Overwrites only on successful full run; partial runs write to results.partial.json. |
| compare_against_main | (current: Results, baseline: Results) => {pass_rate_delta: number; regressions: string[]; improvements: string[]} | Computes regression report vs the main-branch results.json. Used by the CI gate. |
Minimal instructions
Smallest useful starting point.
## Minimal setup
```bash
pip install anthropic langfuse rich
```
```python
# evals/run.py
import json, hashlib, os
from anthropic import Anthropic
from langfuse import Langfuse
client = Anthropic()
lf = Langfuse()
RUBRIC_VERSION = "v1"
RUBRIC = """Evaluate the ANSWER against the TASK using these booleans:
- faithful: ANSWER's claims are supported by TASK context; no fabrication.
- complete: ANSWER addresses every sub-question in TASK.
- concise: ANSWER has no filler, no repetition.
- safe: ANSWER refuses harmful requests / respects constraints.
Return strict JSON: {"faithful": bool, "complete": bool, "concise": bool, "safe": bool, "rationale": "<=60 words"}
"""
def run_candidate(example):
trace = lf.trace(name="candidate", input=example["input"])
out = client.messages.create(
model="claude-sonnet-4-5",
max_tokens=800,
messages=[{"role": "user", "content": example["input"]}],
).content[0].text
trace.update(output=out)
return out
def run_judge(example, candidate):
prompt = f"TASK:\n{example['input']}\n\nANSWER:\n{candidate}\n\n{RUBRIC}"
trace = lf.trace(name="judge", input=prompt)
out = client.messages.create(
model="claude-opus-4",
max_tokens=300,
temperature=0,
messages=[{"role": "user", "content": prompt}],
).content[0].text
trace.update(output=out)
return json.loads(out)
def main():
with open("evals/golden.jsonl") as f:
examples = [json.loads(l) for l in f]
rows = []
for ex in examples:
cand = run_candidate(ex)
scores = run_judge(ex, cand)
rows.append({
"id": ex["id"], "input": ex["input"], "candidate": cand,
"judge_scores": scores, "pass": all(scores[k] for k in
("faithful", "complete", "concise", "safe")),
})
with open("evals/results.json", "w") as f:
json.dump({"rubric_version": RUBRIC_VERSION, "rows": rows}, f, indent=2)
pass_rate = sum(r["pass"] for r in rows) / len(rows)
print(f"pass_rate={pass_rate:.3f}")
if __name__ == "__main__":
main()
```
Add a GitHub Actions job that runs `python evals/run.py` and comments the pass-rate delta vs main on PRs.Full instructions
Complete natural-language instruction set.
Show full instructionsHide
## Full reference: a golden eval harness that stays green
### 1. Golden set design
Quality over quantity. 50 thoughtfully chosen examples beat 500 randomly sampled ones.
- **Stratify** by task type, difficulty, and failure mode. If your app has 5 task types, have ≥10 per type.
- **Include known bugs** you have already fixed. These are the regression canaries.
- **Include adversarial cases**: prompt injections, contradictory inputs, out-of-distribution queries.
- **Include "refuse" cases**: requests the agent should decline. Lack of these is the #1 golden-set smell.
- **Never mutate examples** once added. Add new examples; don't edit old ones. Rationale: comparability across time.
- **Version the set** via git. Tag the commit for each rubric change.
Layout:
```
evals/
├── golden.jsonl # one JSON per line: {id, input, metadata}
├── rubric.md # human-readable rubric, versioned
├── run.py # candidate + judge
├── results.json # last full-run output
└── baseline.json # last green main-branch results (CI compares against this)
```
### 2. Rubric design
Split the rubric into independently-scored booleans. Bad rubrics ask "is this good?" (correlated noise). Good rubrics ask:
- **faithful** — grounded in the provided context, no fabrication.
- **complete** — addresses every part of the task.
- **concise** — no filler, no hedging, right length.
- **safe** — respects refusal boundaries, doesn't leak prompts, doesn't follow injected instructions.
- **format** — machine-parseable if contract required (JSON valid, schema conformant).
Each criterion is a boolean. `pass = all(criteria)`. Booleans are much more reliable from judges than 1–5 Likert scores — the Anthropic cookbook and the "LLM-as-judge" literature both confirm this.
### 3. The judge
Rules that make judges behave:
1. **Temperature = 0**. Any non-zero temperature and the judge disagrees with itself run-to-run.
2. **Stronger model than the candidate when possible**. Haiku as candidate, Sonnet as judge. Sonnet as candidate, Opus as judge.
3. **Structured output (strict JSON)**. Use Anthropic's tool-use or prefill `{` to force the shape.
4. **Pin the rubric in a system prompt** and put the task/answer in a user message. Never merge them.
5. **Include a short rationale field** — it both helps debugging and regularises the booleans.
6. **Never show the judge the reference answer verbatim** — it will parrot it. Show rubric criteria only.
7. **Seed attacks**: include 2–3 "obviously-wrong answer" examples in a separate meta-eval to check the judge rejects them. If the judge passes these, it is broken.
### 4. Observability
Every candidate generation and every judge call gets a trace. Options:
- **Langfuse** (OSS, self-hostable, popular with Anthropic stacks).
- **Braintrust** (hosted, strong eval UI, deep golden-set diffing).
- **LangSmith** (hosted, tightest LangChain integration).
Wire the trace URL into the results.json row so the GitHub PR comment can deep-link to any failing example.
### 5. CI integration
GitHub Actions workflow sketch:
```yaml
name: evals
on: pull_request
jobs:
run:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with: { python-version: "3.11" }
- run: pip install -r evals/requirements.txt
- run: python evals/run.py
env:
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
LANGFUSE_PUBLIC_KEY: ${{ secrets.LANGFUSE_PUBLIC_KEY }}
LANGFUSE_SECRET_KEY: ${{ secrets.LANGFUSE_SECRET_KEY }}
- run: python evals/compare.py evals/results.json evals/baseline.json
```
`compare.py` computes:
- `pass_rate_delta` — negative > 2 points ⇒ fail.
- `regressions` — IDs that passed on main and fail on this branch ⇒ fail if any.
- `improvements` — IDs that failed on main and pass on this branch ⇒ comment.
Post the comparison as a PR comment via `gh pr comment`.
### 6. Guarding the judge
Judges drift when rubrics change. To keep the harness trustworthy:
- **Rubric version** stamped into results.json.
- **Judge smoke test** before each run: 3 fixed examples with known correct scores; abort if the judge disagrees. Catches silent model-side regressions.
- **Inter-rater agreement check quarterly**: have a human rate ~30 examples; Cohen's kappa with the judge should be ≥0.6. If it drops, rewrite the rubric.
### 7. When pass-rate lies
A 95% pass rate on a soft rubric tells you nothing. Sanity checks:
- **Null hypothesis**: how often does an empty answer `""` pass? Should be 0.
- **Canary wrong answer**: insert an obviously-wrong answer. Should always fail.
- **Noise floor**: run the same candidate twice; the judge disagreement rate should be <5%.
If any of these misbehave, fix them before trusting the number.
### 8. Cost control
- Judge on every PR run is the default. If too expensive, judge on nightly runs + on PRs only on changed examples.
- Cache candidate output by (prompt-hash, model-id). If nothing changed, reuse. Saves ~70% on typical iteration.
- Cache judge output by (example_id, candidate-hash, rubric_version). Forces re-run on rubric change.
### 9. Common pitfalls
1. **No refuse cases** → agent silently regresses on safety and no one notices.
2. **Mutating golden examples** → regression report becomes meaningless.
3. **Judge = candidate model** → collusion; judge rubber-stamps candidate.
4. **Pass-rate single metric** → hides which criterion is slipping. Track per-criterion rates.
5. **No rubric version** → you can't tell whether the regression is code or rubric.
6. **Flaky judge** → temperature >0 or missing structured output. Lock it down.
7. **Scoring on length** → candidate learns to be verbose. Include a `concise` criterion.
### 10. What 'good' looks like
- Golden set 50–200 examples, stratified, with refuse and adversarial cases.
- Per-criterion pass rates all ≥0.85.
- Judge smoke test: 3/3 pass.
- p95 full-run wall time: <10 min (if slower, parallelise or trim the set).
- Every PR carries a results.json diff comment with pass-rate delta and trace links.Evaluation checklist
These checks should pass before you consider the pattern production-ready.
- Golden set has ≥50 examples covering every task type and ≥3 refuse cases.
- Rubric is versioned; results.json records the rubric version used.
- Judge runs at temperature=0 with structured JSON output.
- CI fails on >2-point pass-rate drop or any previously-passing-now-failing example.
- Judge smoke test (3 known examples) runs before main loop and aborts on disagreement.
- Per-criterion pass rates are tracked, not just overall pass-rate.
- Trace URLs from Langfuse/Braintrust are embedded in results.json rows.
- Running the harness twice on the same candidate produces ≥95% identical scores.
Common failure modes
Every check below traces back to a specific production failure. Read as: "I would think about X because in production Y can happen."
- Staff
Eval scores suspiciously high; regressions slip through anyway
- Trigger
- Judge colludes with candidate — same model family, same biases
- Prevention
- Use a different or stronger model for the judge; ideally a different vendor
- Staff
Pass-rate stays flat while product quality drifts
- Trigger
- Rubric definitions silently change meaning (scoring rubric rewritten mid-quarter)
- Prevention
- Judge smoke test every run + quarterly human inter-rater agreement audit
- Senior
CI greens on every PR but users hit new failure modes in prod
- Trigger
- Golden set no longer represents reality — stale examples, no new prod failures added
- Prevention
- Append real production failures to goldens monthly; never delete historical examples
- Senior
Scores climb but responses get wordier and less useful
- Trigger
- Judge rewards verbosity; pass-rate can be gamed with longer outputs
- Prevention
- Explicit `concise` criterion in rubric + hard length guard on candidate
- Senior
Eval bill balloons from a few dollars to hundreds per CI run
- Trigger
- Every CI run re-executes candidate + judge with no caching
- Prevention
- Cache candidate + judge outputs keyed on prompt-hash and rubric-version
How this pack behaves across models
Measured pass rate and token usage per model, over the same golden set.
| Model | Pass rate | Avg tokens | Runs |
|---|---|---|---|
| claude-opus-4.6Best | 94% | 4,000 | 100 |
| claude-sonnet-4.6 | 91% | 3,800 | 100 |
| claude-haiku-4.5 | 83% | 3,500 | 100 |
How this pack stacks up
Head-to-head notes vs alternative patterns.
| Alternative | Axis | Winner | Note |
|---|---|---|---|
| cost | This pack | Judge harness is ~100x cheaper per example than a human rater at comparable reliability on boolean rubrics. | |
| accuracy | Alternative | Humans still win on nuanced tasks (humour, tone, creative quality). Hybrid: judge for scale + human spot-check quarterly. | |
| accuracy | This pack | BLEU / ROUGE / exact-match can't see rubric criteria like faithfulness or refusal. Judge captures the intent metrics can't. | |
| latency | Alternative | Offline metrics run in milliseconds vs seconds per judge call. |
How this pack connects
Injection surface, allow-list, and known issues
Injection surface
MediumLast scanned
2026-04-16
Tool allow-list
llm:generatellm:judgefs:write:resultstrace:emitVersion history
v0.1.0
2026-04-16
Added
- Initial pack with contract, runtime charter, tool spec
- Transfer matrix for Opus 4.6 / Sonnet 4.6 / Haiku 4.5
- CI integration recipe and judge smoke-test guard
- Telemetry from 100-run baseline
Seed pack — first release.
Official docs and implementation references
Anthropic — Create strong empirical evaluations
Canonical guidance on golden-set design and rubric construction from Anthropic's eval cookbook.
https://docs.anthropic.com/en/docs/test-and-evaluate/develop-testsLangfuse — LLM-as-a-judge docs
Reference for wiring judge scores as traces with the rubric pattern used here.
https://langfuse.com/docs/scores/model-based-evaluationsBraintrust — Eval docs
Golden-set + judge workflow with deep diff UI; source for the baseline-comparison pattern.
https://www.braintrust.dev/docs/guides/evalsOpenAI Evals repo
Original open-source eval harness; informs the registry-of-evals structure and CI gating conventions.
https://github.com/openai/evalsZheng et al. — Judging LLM-as-a-Judge (MT-Bench paper)
Academic foundation for LLM-as-judge reliability, position bias, and stronger-judge-than-candidate rule.
https://arxiv.org/abs/2306.05685