Back to directoryCompareComparing golden-eval-harness ↔ rag-hybrid-bm25-vector

Pack comparison

Two packs, side-by-side. Merged comparisons, shared shape, and diff highlights in one view.

ACommunityevalevaluator-optimizerv0.1.0Production-ready

Golden Eval Harness

golden-eval-harness

Golden set + LLM judge + traces + CI gates. Ships evals that stay green.

Install

npx attrition-sh pack install golden-eval-harness

Token budget

4,000

Pass rate

91%

Avg tokens

3,900

Publisher

Agent Workspace

Compatibility

claude-codecursorpython-3.11node-20github-actions

Contract summary

3 required outputs, 4 permissions, 4 completion conditions.

out: evals/results.json

BCommunityraghybridv0.1.0Recommended

Hybrid RAG: BM25 + Vector + Rerank

rag-hybrid-bm25-vector

Sparse + dense + cross-encoder. The 2025 retrieval default.

Install

npx attrition-sh pack install rag-hybrid-bm25-vector

Token budget

8,000

Pass rate

—

Avg tokens

—

Publisher

Agent Workspace

Compatibility

claude-codecursorpython-3.11node-20

Contract summary

2 required outputs, 3 permissions, 4 completion conditions.

out: outputs/retrieval.json

Shared shape

What both packs have in common

Overlap across canonical pattern, compatibility, tags, and required packs.

Compatibility

claude-codecursorpython-3.11node-20

Merged comparisons

Head-to-head claims from both packs

Each row is attributed to the pack that authored it. The winner column is normalised to this compare view (A / B / Tie).

Source	Alternative	Axis	Winner	Note
A	human-only-eval	cost	A	Judge harness is ~100x cheaper per example than a human rater at comparable reliability on boolean rubrics.
A	human-only-eval	accuracy	Alternative	Humans still win on nuanced tasks (humour, tone, creative quality). Hybrid: judge for scale + human spot-check quarterly.
A	offline-metrics-only	accuracy	A	BLEU / ROUGE / exact-match can't see rubric criteria like faithfulness or refusal. Judge captures the intent metrics can't.
A	offline-metrics-only	latency	Alternative	Offline metrics run in milliseconds vs seconds per judge call.
B	pure-vector-rag	accuracy	B	Hybrid+rerank beats pure vector by ~10–20% NDCG@10 on technical corpora with proper nouns and code symbols.
B	pure-vector-rag	latency	Alternative	Pure vector is ~50ms faster — skip hybrid if latency budget is <80ms and accuracy is already acceptable.
B	pure-bm25	accuracy	B	BM25 alone loses on paraphrase and conceptual queries. Hybrid adds ~15 points recall on natural-language question sets.
B	pure-bm25	cost	Alternative	BM25-only has no embedding or rerank cost. Hybrid adds ~$0.001/query — negligible for most products but not all.

Diff highlights

What each pack brings that the other doesn't

Unique coverage and any measurable gap between the two.

Unique to A — Golden Eval Harness

Comparisons not in B

human-only-evalhuman-only-evaloffline-metrics-onlyoffline-metrics-only

Compatibility A-only

github-actions

Tags A-only

evalllm-as-judgegolden-setciobservabilityevaluator-optimizer

Unique to B — Hybrid RAG: BM25 + Vector + Rerank

Comparisons not in A

pure-vector-ragpure-vector-ragpure-bm25pure-bm25

Compatibility B-only

(none)

Tags B-only

ragretrievalbm25vector-searchrerankinghybridcitations

Token budget gap

Difference: 4,000 tokens. A (Golden Eval Harness) is the cheaper declared contract.

Swap / reset

Swap A ↔ B Pick two different packs