Pack comparison
Two packs, side-by-side. Merged comparisons, shared shape, and diff highlights in one view.
golden-eval-harness
Golden set + LLM judge + traces + CI gates. Ships evals that stay green.
Install
npx attrition-sh pack install golden-eval-harnessToken budget
4,000
Pass rate
91%
Avg tokens
3,900
Publisher
Agent Workspace
Compatibility
Contract summary
3 required outputs, 4 permissions, 4 completion conditions.
out: evals/results.json
rag-hybrid-bm25-vector
Sparse + dense + cross-encoder. The 2025 retrieval default.
Install
npx attrition-sh pack install rag-hybrid-bm25-vectorToken budget
8,000
Pass rate
—
Avg tokens
—
Publisher
Agent Workspace
Compatibility
Contract summary
2 required outputs, 3 permissions, 4 completion conditions.
out: outputs/retrieval.json
Shared shape
What both packs have in common
Overlap across canonical pattern, compatibility, tags, and required packs.
Compatibility
Merged comparisons
Head-to-head claims from both packs
Each row is attributed to the pack that authored it. The winner column is normalised to this compare view (A / B / Tie).
| Source | Alternative | Axis | Winner | Note |
|---|---|---|---|---|
| A | human-only-eval | cost | A | Judge harness is ~100x cheaper per example than a human rater at comparable reliability on boolean rubrics. |
| A | human-only-eval | accuracy | Alternative | Humans still win on nuanced tasks (humour, tone, creative quality). Hybrid: judge for scale + human spot-check quarterly. |
| A | offline-metrics-only | accuracy | A | BLEU / ROUGE / exact-match can't see rubric criteria like faithfulness or refusal. Judge captures the intent metrics can't. |
| A | offline-metrics-only | latency | Alternative | Offline metrics run in milliseconds vs seconds per judge call. |
| B | pure-vector-rag | accuracy | B | Hybrid+rerank beats pure vector by ~10–20% NDCG@10 on technical corpora with proper nouns and code symbols. |
| B | pure-vector-rag | latency | Alternative | Pure vector is ~50ms faster — skip hybrid if latency budget is <80ms and accuracy is already acceptable. |
| B | pure-bm25 | accuracy | B | BM25 alone loses on paraphrase and conceptual queries. Hybrid adds ~15 points recall on natural-language question sets. |
| B | pure-bm25 | cost | Alternative | BM25-only has no embedding or rerank cost. Hybrid adds ~$0.001/query — negligible for most products but not all. |
Diff highlights
What each pack brings that the other doesn't
Unique coverage and any measurable gap between the two.
Unique to A — Golden Eval Harness
Comparisons not in B
Compatibility A-only
Tags A-only
Unique to B — Hybrid RAG: BM25 + Vector + Rerank
Comparisons not in A
Compatibility B-only
(none)
Tags B-only
Token budget gap
Difference: 4,000 tokens. A (Golden Eval Harness) is the cheaper declared contract.
Swap / reset