Communityraghybridv0.1.0Recommended

Hybrid RAG: BM25 + Vector + Rerank

Sparse + dense + cross-encoder. The 2025 retrieval default.

Agent WorkspaceSigned (unverified)Verified publisher·Updated 2026-04-16·~0 installs this month · saved ~0 tokens

Install

One-line install

npx attrition-sh pack install rag-hybrid-bm25-vector

AGENTS.md snippet (Claude Code / Cursor)

Skill `rag-hybrid-bm25-vector` is installed at .claude/skills/rag-hybrid-bm25-vector/SKILL.md. Invoke whenever the user needs retrieval with both keyword precision and semantic recall, or when pure-vector RAG is hallucinating on proper nouns / code symbols / abbreviations. Always include the rerank stage for corpora >10k docs; emit citations alongside retrieved passages.

Raw Markdown

Machine-readable body for agent ingestion or copy/paste.

Download as .md

Telemetry

Not yet measured

Rediscovery cost

Skipping this saves ~42,000 tokens / 90 min of research.

Methodology

Measured 2026-04-16

Prompted a fresh Claude Sonnet 4.6 with 'design a production hybrid retrieval system beating pure vector'. Measured tokens until the output included RRF (with k=60), a specific rerank model, chunking strategy, citation contract, and latency budget. Averaged over 3 runs. Does not include the separate cost of discovering the right rerank model card.

Summary

Production-grade hybrid retrieval: BM25 for lexical recall, dense embeddings for semantic recall, Reciprocal Rank Fusion to merge, and a BGE / Cohere cross-encoder rerank stage for precision. Emits cited retrieved_docs ready to pass to a grounded-answer generator. Matches the retrieval stack Anthropic, Pinecone, and Weaviate recommend for >100k-doc corpora.

Fit and expected payoff

When this pack earns its extra structure, when to skip it, and what it should improve.

Use when

Situations where this pack earns its extra structure.

Corpus contains proper nouns, SKUs, error codes, or legal/medical terms — pure vector loses these.
You need >0.85 NDCG@10 on a domain-specific golden set.
Query distribution is mixed (keyword lookups + natural-language questions).
You can afford 150–400ms retrieval latency budget.

Avoid when

Keeps the pack from becoming a default hammer.

Corpus is <5k docs — a single BM25 or vector stage is usually enough.
Latency budget is <50ms (rerank adds ~80–200ms).
You have no labeled eval set yet — ship pure-vector first, measure, then upgrade.
Query distribution is 100% conversational paraphrase (dense-only may suffice).

What it improves

Expected outcomes if implemented well.

Recall@50 improves 8–20% vs pure vector on typical technical corpora.
Precision@5 improves 10–30% after cross-encoder rerank.
Citations surface per passage — no hallucinated provenance.
Single retrieval.json contract — the generator module can be swapped without touching retrieval.

Bounded invocation surface

Turns fuzzy LLM calls into bounded agent invocations (Tongyi NLA pattern).

Required outputs

retrieved_docs
citations

Permissions

search:bm25
search:vector
rerank:cross-encoder

Completion conditions

retrieved_docs has length between 3 and 10
each retrieved_doc has id, text, source, rerank_score
citations array maps each retrieved_doc to a canonical source URL or doc_id
total retrieved_doc text tokens <= 6000

Token budget

8,000

Output path

outputs/retrieval.json

Minimal instructions

Smallest useful starting point.

## Minimal setup (Python)

```bash
pip install rank-bm25 sentence-transformers qdrant-client FlagEmbedding
```

```python
from rank_bm25 import BM25Okapi
from sentence_transformers import SentenceTransformer
from FlagEmbedding import FlagReranker

# --- index time ---
docs = [...]  # list[dict{id, text, source}]
tokenised = [d["text"].lower().split() for d in docs]
bm25 = BM25Okapi(tokenised)
embedder = SentenceTransformer("BAAI/bge-large-en-v1.5")
dense = embedder.encode([d["text"] for d in docs], normalize_embeddings=True)
# persist dense to qdrant / pgvector

reranker = FlagReranker("BAAI/bge-reranker-large", use_fp16=True)

def rrf(rankings: list[list[str]], k: int = 60) -> dict[str, float]:
    scores = {}
    for r in rankings:
        for rank, doc_id in enumerate(r):
            scores[doc_id] = scores.get(doc_id, 0) + 1 / (k + rank + 1)
    return scores

def retrieve(query: str, top_k: int = 5) -> list[dict]:
    # 1. BM25 top 50
    bm25_ranked = sorted(
        enumerate(bm25.get_scores(query.lower().split())),
        key=lambda x: -x[1],
    )[:50]
    bm25_ids = [docs[i]["id"] for i, _ in bm25_ranked]

    # 2. Dense top 50
    q_vec = embedder.encode([query], normalize_embeddings=True)[0]
    dense_ranked = sorted(
        enumerate(dense @ q_vec),
        key=lambda x: -x[1],
    )[:50]
    dense_ids = [docs[i]["id"] for i, _ in dense_ranked]

    # 3. RRF merge → take top 20
    fused = sorted(rrf([bm25_ids, dense_ids]).items(), key=lambda x: -x[1])[:20]
    by_id = {d["id"]: d for d in docs}
    candidates = [by_id[doc_id] for doc_id, _ in fused]

    # 4. Rerank
    pairs = [[query, c["text"]] for c in candidates]
    scores = reranker.compute_score(pairs, normalize=True)
    for c, s in zip(candidates, scores):
        c["rerank_score"] = float(s)
    candidates.sort(key=lambda c: -c["rerank_score"])
    return candidates[:top_k]
```

Emit `{retrieved_docs: [...], citations: [{doc_id, source}]}` per the contract.

Full instructions

Complete natural-language instruction set.

Show full instructions

## Full reference: hybrid retrieval stack

### 1. Why hybrid

Pure vector search fails on three query classes:

- **Exact tokens** — SKUs, error codes, product names, typos. Dense embeddings smear these into neighbourhood.
- **Negation and numbers** — "not 2FA", "v18", "80%". Embeddings ignore these routinely.
- **Rare terms** — proper nouns appearing <5 times in the corpus. No semantic gradient to lean on.

Pure BM25 fails on paraphrase and conceptual questions. Combining both with RRF fixes both failure modes at near-zero code cost. Cross-encoder reranking then pushes precision on the merged shortlist.

This is the stack Microsoft's "Azure AI Search" docs, Pinecone's hybrid-search guide, Weaviate's hybrid module, and Anthropic's 2024 contextual-retrieval post all converge on.

### 2. Pipeline

```
query ──► BM25 top-50 ──┐
                        ├──► RRF merge top-20 ──► cross-encoder rerank ──► top-5 ──► LLM
query ──► Dense top-50 ──┘
```

Knobs (with sane defaults):

| Knob | Default | Notes |
|---|---|---|
| Sparse candidates | 50 | Raise to 100 for technical corpora (code, errors). |
| Dense candidates | 50 | Match sparse. |
| RRF k | 60 | The original paper's value. Rarely move. |
| Merged pool | 20 | Input to reranker; 20–40 is the sweet spot. |
| Final top-k | 5 | 3–8 depending on generator token budget. |

### 3. BM25 index

Preprocessing matters more than the scoring formula:

- Lowercase, Unicode-normalise.
- Tokenise on whitespace + punctuation — but keep identifiers (`foo_bar`, `Foo.Bar`) intact for code corpora.
- Remove stopwords only for prose; keep them for code.
- Stem optionally (Porter or Snowball) for English prose.
- Store per-doc length; BM25 needs it.

Libraries: `rank_bm25` (Python, in-memory, fine to 1M docs), `Elasticsearch`/`OpenSearch` (scales), `tantivy` (Rust, fast).

### 4. Dense index

- **Embedding model**: `BAAI/bge-large-en-v1.5` is the 2025 default open-source model. `text-embedding-3-large` (OpenAI) or `voyage-3` (Voyage) if you use closed APIs. All three are cosine-similarity-normalised.
- **Store**: Qdrant, pgvector, Weaviate, Pinecone. Use HNSW with `M=32, ef_construction=200` as a reasonable default for <10M vectors.
- **Chunking**: 300–500 tokens per chunk with 20% overlap. Split on semantic boundaries (Markdown headings, paragraphs) not raw char count.

### 5. Reciprocal Rank Fusion (RRF)

```python
def rrf(rankings: list[list[str]], k: int = 60) -> dict[str, float]:
    scores: dict[str, float] = {}
    for ranking in rankings:
        for rank, doc_id in enumerate(ranking):
            scores[doc_id] = scores.get(doc_id, 0) + 1 / (k + rank + 1)
    return scores
```

RRF is score-scale-free — BM25 scores (0–50) and cosine similarities (0–1) don't need normalisation. This is why it beats weighted linear combinations in practice.

### 6. Cross-encoder rerank

Bi-encoder retrievers (dense) embed query and doc independently. Cross-encoders concatenate `[CLS] query [SEP] doc` and run a single forward pass per pair — far more accurate, much slower. Only use them on the merged shortlist (~20 pairs).

Options:

- **BGE-reranker-large**: open source, ~560M params, ~80ms/pair on GPU, free.
- **Cohere Rerank v3**: hosted API, ~40ms/pair p95, $1/1k searches.
- **Voyage rerank-2**: hosted, competitive with Cohere.
- **ColBERT / ColBERTv2**: late-interaction, faster than cross-encoder at scale but indexing is heavier.

Budget: a 20-pair rerank on BGE-large-fp16 is ~1.6s on CPU, ~160ms on a T4 GPU. Batch the pairs.

### 7. Citations contract

Every retrieved doc must carry:

```json
{
  "id": "doc_42#chunk_3",
  "text": "…",
  "source": "https://docs.example.com/auth/oauth",
  "rerank_score": 0.91,
  "bm25_rank": 4,
  "dense_rank": 2
}
```

The generator renders citations as footnote-style superscripts. Never let the generator invent URLs — citations come only from `retrieved_docs[].source`.

### 8. Evaluation

You need a golden set of (query, ideal_doc_ids) pairs. Track:

- **Recall@50** of the merged pool — did the right doc make it to rerank?
- **NDCG@10** after rerank — is rank quality good?
- **Faithfulness** (LLM-as-judge) — does the generated answer actually use the retrieved docs?
- **Latency p95** at each stage.

Budget 50–200 labelled queries minimum. Reuse the `golden-eval-harness` pack to wire this into CI.

### 9. Common pitfalls

1. **Tokenising code like prose** → camelCase and snake_case split into pieces BM25 can't match. Use a code-aware tokeniser.
2. **Normalising RRF scores to top=1** → breaks the score-scale-free property. Leave them.
3. **Reranking the full corpus** → cost explodes. Rerank only the merged shortlist.
4. **Chunking too small** (<150 tokens) → dense retrieval loses context; reranker starves.
5. **Returning raw chunk text to the LLM** → no citation possible. Always carry `source` through.
6. **Reranker on an old query/passage pair format** → BGE expects `[query, passage]`; Cohere expects `documents[]`. Read the model card.

### 10. Cost envelope (per query, indicative)

| Stage | Cost | Latency |
|---|---|---|
| BM25 (50 hits, in-memory) | ~$0 | 2–10 ms |
| Dense (50 hits, Qdrant) | ~$0 | 15–40 ms |
| RRF merge | ~$0 | <1 ms |
| Rerank 20 pairs (Cohere) | ~$0.0010 | 40–80 ms |
| **Total** | **~$0.001** | **~100 ms** |

### 11. When to stop

If your eval shows Recall@50 >0.98 *and* post-rerank NDCG@10 plateaus, you are retrieval-bound no more — spend the next cycle on the generator (prompting, grounding checks) or on chunking quality, not on more fusion.

Evaluation checklist

These checks should pass before you consider the pattern production-ready.

Recall@50 on golden set improves vs pure vector baseline (≥5 points).
Rerank increases NDCG@10 vs raw RRF ordering (≥0.05 absolute).
Every retrieved_doc carries id, text, source, rerank_score.
Citations array length equals retrieved_docs length; no URL is synthesized.
p95 latency budget (e.g. 300ms) met under sustained load, measured with 100 parallel queries.
Failure when reranker times out: fall back to RRF-only ordering and log.
Token budget (6000 text tokens across retrieved_docs) enforced by truncation, not by omission of relevant hits.

Common failure modes

Every check below traces back to a specific production failure. Read as: "I would think about X because in production Y can happen."

Senior
Users search for content they just added; dense retrieval misses it
Trigger
Dense index stale after corpus update; partial re-index left inconsistency
Prevention
Atomic re-index to a new collection + version tag; swap only after full build
Senior
Reranker scores drop 20% overnight without a code change
Trigger
Reranker model updated; query/passage format spec drifted
Prevention
Pin the reranker model card as a test fixture; fail CI on format drift
Mid
BM25 can't find documents referenced by exact identifier
Trigger
Default prose tokeniser splits identifiers and code tokens wrongly
Prevention
Swap in a code-aware tokeniser when indexing code corpora
Staff
Citations link to the wrong paragraph after a chunker change
Trigger
Chunk IDs are sequential indices; any re-chunk invalidates all prior citations
Prevention
Chunk IDs must be content-hashes, not positional
Senior
Recall varies wildly between query types; each engineer tunes differently
Trigger
Reciprocal-Rank-Fusion k parameter being hand-tuned per query class
Prevention
Pin k=60 (literature default); only move it with golden-set A/B evidence
See also
golden-eval-harness

How this pack stacks up

Head-to-head notes vs alternative patterns.

Alternative	Axis	Winner	Note
pure-vector-rag Compare →	accuracy	This pack	Hybrid+rerank beats pure vector by ~10–20% NDCG@10 on technical corpora with proper nouns and code symbols.
pure-vector-rag Compare →	latency	Alternative	Pure vector is ~50ms faster — skip hybrid if latency budget is <80ms and accuracy is already acceptable.
pure-bm25 Compare →	accuracy	This pack	BM25 alone loses on paraphrase and conceptual queries. Hybrid adds ~15 points recall on natural-language question sets.
pure-bm25 Compare →	cost	Alternative	BM25-only has no embedding or rerank cost. Hybrid adds ~$0.001/query — negligible for most products but not all.

How this pack connects

golden-eval-harness pattern-decision-tree

Injection surface, allow-list, and known issues

Injection surface

Medium

Last scanned

2026-04-16

Tool allow-list

search:bm25search:vectorrerank:cross-encoder

Known issues

Prompt injection via retrieved passage content is a live risk — generator must treat retrieved text as untrusted and refuse to follow embedded instructions. Use a system-prompt guard.

Version history

v0.1.0
2026-04-16
Added
- Initial pack with BM25 + dense + RRF + BGE-rerank pipeline
- Contract with retrieved_docs and citations outputs
- Cost envelope and latency budget table
- Evaluation guidance keyed to golden-eval-harness pack
Seed pack — first release.

Official docs and implementation references

Anthropic — Contextual Retrieval

Contextual-retrieval post that formalises BM25 + embedding + rerank as the 2024+ default stack.

https://www.anthropic.com/news/contextual-retrieval

Microsoft — Hybrid search in Azure AI Search

Microsoft's official guidance on combining BM25 + vector with RRF; matches the pipeline in this pack.

https://learn.microsoft.com/en-us/azure/search/hybrid-search-overview

Pinecone — Hybrid search guide

Vendor-agnostic explanation of sparse+dense fusion, including alpha-weighted variants.

https://docs.pinecone.io/guides/search/hybrid-search

Weaviate — Hybrid search docs

Reference for the alpha parameter and server-side RRF implementation.

https://weaviate.io/developers/weaviate/search/hybrid

BGE Reranker model card

Primary source on the recommended open-source cross-encoder; includes input format requirements.

https://huggingface.co/BAAI/bge-reranker-large

ColBERTv2 paper

Late-interaction alternative to cross-encoder rerank when you need throughput at scale.

https://arxiv.org/abs/2112.01488

Reference implementations

Qdrant hybrid search example Pinecone hybrid search notebook