Agent Workspace

Pack comparison

Two packs, side-by-side. Merged comparisons, shared shape, and diff highlights in one view.

ACommunityevalevaluator-optimizerv0.1.0Production-ready
Golden Eval Harness

golden-eval-harness

Golden set + LLM judge + traces + CI gates. Ships evals that stay green.

npx attrition-sh pack install golden-eval-harness

Token budget

4,000

Pass rate

91%

Avg tokens

3,900

Publisher

Agent Workspace

claude-codecursorpython-3.11node-20github-actions

3 required outputs, 4 permissions, 4 completion conditions.

out: evals/results.json

BCommunityraghybridv0.1.0Recommended
Hybrid RAG: BM25 + Vector + Rerank

rag-hybrid-bm25-vector

Sparse + dense + cross-encoder. The 2025 retrieval default.

npx attrition-sh pack install rag-hybrid-bm25-vector

Token budget

8,000

Pass rate

Avg tokens

Publisher

Agent Workspace

claude-codecursorpython-3.11node-20

2 required outputs, 3 permissions, 4 completion conditions.

out: outputs/retrieval.json

What both packs have in common

Overlap across canonical pattern, compatibility, tags, and required packs.

claude-codecursorpython-3.11node-20

Head-to-head claims from both packs

Each row is attributed to the pack that authored it. The winner column is normalised to this compare view (A / B / Tie).

SourceAlternativeAxisWinnerNote
Ahuman-only-evalcostAJudge harness is ~100x cheaper per example than a human rater at comparable reliability on boolean rubrics.
Ahuman-only-evalaccuracyAlternativeHumans still win on nuanced tasks (humour, tone, creative quality). Hybrid: judge for scale + human spot-check quarterly.
Aoffline-metrics-onlyaccuracyABLEU / ROUGE / exact-match can't see rubric criteria like faithfulness or refusal. Judge captures the intent metrics can't.
Aoffline-metrics-onlylatencyAlternativeOffline metrics run in milliseconds vs seconds per judge call.
Bpure-vector-ragaccuracyBHybrid+rerank beats pure vector by ~10–20% NDCG@10 on technical corpora with proper nouns and code symbols.
Bpure-vector-raglatencyAlternativePure vector is ~50ms faster — skip hybrid if latency budget is <80ms and accuracy is already acceptable.
Bpure-bm25accuracyBBM25 alone loses on paraphrase and conceptual queries. Hybrid adds ~15 points recall on natural-language question sets.
Bpure-bm25costAlternativeBM25-only has no embedding or rerank cost. Hybrid adds ~$0.001/query — negligible for most products but not all.

What each pack brings that the other doesn't

Unique coverage and any measurable gap between the two.

Comparisons not in B

human-only-evalhuman-only-evaloffline-metrics-onlyoffline-metrics-only

Compatibility A-only

github-actions

Tags A-only

evalllm-as-judgegolden-setciobservabilityevaluator-optimizer

Comparisons not in A

pure-vector-ragpure-vector-ragpure-bm25pure-bm25

Compatibility B-only

(none)

Tags B-only

ragretrievalbm25vector-searchrerankinghybridcitations

Difference: 4,000 tokens. A (Golden Eval Harness) is the cheaper declared contract.