---
slug: "rag-hybrid-bm25-vector"
name: "Hybrid RAG: BM25 + Vector + Rerank"
packType: "rag"
canonicalPattern: "hybrid"
version: "0.1.0"
trust: "Community"
publisher: "Agent Workspace"
updatedAt: "2026-04-16"
---

# Hybrid RAG: BM25 + Vector + Rerank

> Sparse + dense + cross-encoder. The 2025 retrieval default.

## Summary

Production-grade hybrid retrieval: BM25 for lexical recall, dense embeddings for semantic recall, Reciprocal Rank Fusion to merge, and a BGE / Cohere cross-encoder rerank stage for precision. Emits cited retrieved_docs ready to pass to a grounded-answer generator. Matches the retrieval stack Anthropic, Pinecone, and Weaviate recommend for >100k-doc corpora.

## Install

```sh
npx attrition-sh pack install rag-hybrid-bm25-vector
```

### Claude Code / AGENTS.md snippet

```md
Skill `rag-hybrid-bm25-vector` is installed at .claude/skills/rag-hybrid-bm25-vector/SKILL.md. Invoke whenever the user needs retrieval with both keyword precision and semantic recall, or when pure-vector RAG is hallucinating on proper nouns / code symbols / abbreviations. Always include the rerank stage for corpora >10k docs; emit citations alongside retrieved passages.
```

## Contract

```json
{
  "requiredOutputs": [
    "retrieved_docs",
    "citations"
  ],
  "tokenBudget": 8000,
  "permissions": [
    "search:bm25",
    "search:vector",
    "rerank:cross-encoder"
  ],
  "completionConditions": [
    "retrieved_docs has length between 3 and 10",
    "each retrieved_doc has id, text, source, rerank_score",
    "citations array maps each retrieved_doc to a canonical source URL or doc_id",
    "total retrieved_doc text tokens <= 6000"
  ],
  "outputPath": "outputs/retrieval.json"
}
```

## Layers

_No three-layer split defined for this pack type._

## Use When

- Corpus contains proper nouns, SKUs, error codes, or legal/medical terms — pure vector loses these.
- You need >0.85 NDCG@10 on a domain-specific golden set.
- Query distribution is mixed (keyword lookups + natural-language questions).
- You can afford 150–400ms retrieval latency budget.

## Avoid When

- Corpus is <5k docs — a single BM25 or vector stage is usually enough.
- Latency budget is <50ms (rerank adds ~80–200ms).
- You have no labeled eval set yet — ship pure-vector first, measure, then upgrade.
- Query distribution is 100% conversational paraphrase (dense-only may suffice).

## Key Outcomes

- Recall@50 improves 8–20% vs pure vector on typical technical corpora.
- Precision@5 improves 10–30% after cross-encoder rerank.
- Citations surface per passage — no hallucinated provenance.
- Single retrieval.json contract — the generator module can be swapped without touching retrieval.

## Minimal Instructions

## Minimal setup (Python)

```bash
pip install rank-bm25 sentence-transformers qdrant-client FlagEmbedding
```

```python
from rank_bm25 import BM25Okapi
from sentence_transformers import SentenceTransformer
from FlagEmbedding import FlagReranker

# --- index time ---
docs = [...]  # list[dict{id, text, source}]
tokenised = [d["text"].lower().split() for d in docs]
bm25 = BM25Okapi(tokenised)
embedder = SentenceTransformer("BAAI/bge-large-en-v1.5")
dense = embedder.encode([d["text"] for d in docs], normalize_embeddings=True)
# persist dense to qdrant / pgvector

reranker = FlagReranker("BAAI/bge-reranker-large", use_fp16=True)

def rrf(rankings: list[list[str]], k: int = 60) -> dict[str, float]:
    scores = {}
    for r in rankings:
        for rank, doc_id in enumerate(r):
            scores[doc_id] = scores.get(doc_id, 0) + 1 / (k + rank + 1)
    return scores

def retrieve(query: str, top_k: int = 5) -> list[dict]:
    # 1. BM25 top 50
    bm25_ranked = sorted(
        enumerate(bm25.get_scores(query.lower().split())),
        key=lambda x: -x[1],
    )[:50]
    bm25_ids = [docs[i]["id"] for i, _ in bm25_ranked]

    # 2. Dense top 50
    q_vec = embedder.encode([query], normalize_embeddings=True)[0]
    dense_ranked = sorted(
        enumerate(dense @ q_vec),
        key=lambda x: -x[1],
    )[:50]
    dense_ids = [docs[i]["id"] for i, _ in dense_ranked]

    # 3. RRF merge → take top 20
    fused = sorted(rrf([bm25_ids, dense_ids]).items(), key=lambda x: -x[1])[:20]
    by_id = {d["id"]: d for d in docs}
    candidates = [by_id[doc_id] for doc_id, _ in fused]

    # 4. Rerank
    pairs = [[query, c["text"]] for c in candidates]
    scores = reranker.compute_score(pairs, normalize=True)
    for c, s in zip(candidates, scores):
        c["rerank_score"] = float(s)
    candidates.sort(key=lambda c: -c["rerank_score"])
    return candidates[:top_k]
```

Emit `{retrieved_docs: [...], citations: [{doc_id, source}]}` per the contract.

## Full Instructions

## Full reference: hybrid retrieval stack

### 1. Why hybrid

Pure vector search fails on three query classes:

- **Exact tokens** — SKUs, error codes, product names, typos. Dense embeddings smear these into neighbourhood.
- **Negation and numbers** — "not 2FA", "v18", "80%". Embeddings ignore these routinely.
- **Rare terms** — proper nouns appearing <5 times in the corpus. No semantic gradient to lean on.

Pure BM25 fails on paraphrase and conceptual questions. Combining both with RRF fixes both failure modes at near-zero code cost. Cross-encoder reranking then pushes precision on the merged shortlist.

This is the stack Microsoft's "Azure AI Search" docs, Pinecone's hybrid-search guide, Weaviate's hybrid module, and Anthropic's 2024 contextual-retrieval post all converge on.

### 2. Pipeline

```
query ──► BM25 top-50 ──┐
                        ├──► RRF merge top-20 ──► cross-encoder rerank ──► top-5 ──► LLM
query ──► Dense top-50 ──┘
```

Knobs (with sane defaults):

| Knob | Default | Notes |
|---|---|---|
| Sparse candidates | 50 | Raise to 100 for technical corpora (code, errors). |
| Dense candidates | 50 | Match sparse. |
| RRF k | 60 | The original paper's value. Rarely move. |
| Merged pool | 20 | Input to reranker; 20–40 is the sweet spot. |
| Final top-k | 5 | 3–8 depending on generator token budget. |

### 3. BM25 index

Preprocessing matters more than the scoring formula:

- Lowercase, Unicode-normalise.
- Tokenise on whitespace + punctuation — but keep identifiers (`foo_bar`, `Foo.Bar`) intact for code corpora.
- Remove stopwords only for prose; keep them for code.
- Stem optionally (Porter or Snowball) for English prose.
- Store per-doc length; BM25 needs it.

Libraries: `rank_bm25` (Python, in-memory, fine to 1M docs), `Elasticsearch`/`OpenSearch` (scales), `tantivy` (Rust, fast).

### 4. Dense index

- **Embedding model**: `BAAI/bge-large-en-v1.5` is the 2025 default open-source model. `text-embedding-3-large` (OpenAI) or `voyage-3` (Voyage) if you use closed APIs. All three are cosine-similarity-normalised.
- **Store**: Qdrant, pgvector, Weaviate, Pinecone. Use HNSW with `M=32, ef_construction=200` as a reasonable default for <10M vectors.
- **Chunking**: 300–500 tokens per chunk with 20% overlap. Split on semantic boundaries (Markdown headings, paragraphs) not raw char count.

### 5. Reciprocal Rank Fusion (RRF)

```python
def rrf(rankings: list[list[str]], k: int = 60) -> dict[str, float]:
    scores: dict[str, float] = {}
    for ranking in rankings:
        for rank, doc_id in enumerate(ranking):
            scores[doc_id] = scores.get(doc_id, 0) + 1 / (k + rank + 1)
    return scores
```

RRF is score-scale-free — BM25 scores (0–50) and cosine similarities (0–1) don't need normalisation. This is why it beats weighted linear combinations in practice.

### 6. Cross-encoder rerank

Bi-encoder retrievers (dense) embed query and doc independently. Cross-encoders concatenate `[CLS] query [SEP] doc` and run a single forward pass per pair — far more accurate, much slower. Only use them on the merged shortlist (~20 pairs).

Options:

- **BGE-reranker-large**: open source, ~560M params, ~80ms/pair on GPU, free.
- **Cohere Rerank v3**: hosted API, ~40ms/pair p95, $1/1k searches.
- **Voyage rerank-2**: hosted, competitive with Cohere.
- **ColBERT / ColBERTv2**: late-interaction, faster than cross-encoder at scale but indexing is heavier.

Budget: a 20-pair rerank on BGE-large-fp16 is ~1.6s on CPU, ~160ms on a T4 GPU. Batch the pairs.

### 7. Citations contract

Every retrieved doc must carry:

```json
{
  "id": "doc_42#chunk_3",
  "text": "…",
  "source": "https://docs.example.com/auth/oauth",
  "rerank_score": 0.91,
  "bm25_rank": 4,
  "dense_rank": 2
}
```

The generator renders citations as footnote-style superscripts. Never let the generator invent URLs — citations come only from `retrieved_docs[].source`.

### 8. Evaluation

You need a golden set of (query, ideal_doc_ids) pairs. Track:

- **Recall@50** of the merged pool — did the right doc make it to rerank?
- **NDCG@10** after rerank — is rank quality good?
- **Faithfulness** (LLM-as-judge) — does the generated answer actually use the retrieved docs?
- **Latency p95** at each stage.

Budget 50–200 labelled queries minimum. Reuse the `golden-eval-harness` pack to wire this into CI.

### 9. Common pitfalls

1. **Tokenising code like prose** → camelCase and snake_case split into pieces BM25 can't match. Use a code-aware tokeniser.
2. **Normalising RRF scores to top=1** → breaks the score-scale-free property. Leave them.
3. **Reranking the full corpus** → cost explodes. Rerank only the merged shortlist.
4. **Chunking too small** (<150 tokens) → dense retrieval loses context; reranker starves.
5. **Returning raw chunk text to the LLM** → no citation possible. Always carry `source` through.
6. **Reranker on an old query/passage pair format** → BGE expects `[query, passage]`; Cohere expects `documents[]`. Read the model card.

### 10. Cost envelope (per query, indicative)

| Stage | Cost | Latency |
|---|---|---|
| BM25 (50 hits, in-memory) | ~$0 | 2–10 ms |
| Dense (50 hits, Qdrant) | ~$0 | 15–40 ms |
| RRF merge | ~$0 | <1 ms |
| Rerank 20 pairs (Cohere) | ~$0.0010 | 40–80 ms |
| **Total** | **~$0.001** | **~100 ms** |

### 11. When to stop

If your eval shows Recall@50 >0.98 *and* post-rerank NDCG@10 plateaus, you are retrieval-bound no more — spend the next cycle on the generator (prompting, grounding checks) or on chunking quality, not on more fusion.

## Evaluation Checklist

- Recall@50 on golden set improves vs pure vector baseline (≥5 points).
- Rerank increases NDCG@10 vs raw RRF ordering (≥0.05 absolute).
- Every retrieved_doc carries id, text, source, rerank_score.
- Citations array length equals retrieved_docs length; no URL is synthesized.
- p95 latency budget (e.g. 300ms) met under sustained load, measured with 100 parallel queries.
- Failure when reranker times out: fall back to RRF-only ordering and log.
- Token budget (6000 text tokens across retrieved_docs) enforced by truncation, not by omission of relevant hits.

## Failure Modes

- **[SR] Users search for content they just added; dense retrieval misses it**
  - Trigger: Dense index stale after corpus update; partial re-index left inconsistency
  - Prevention: Atomic re-index to a new collection + version tag; swap only after full build
- **[SR] Reranker scores drop 20% overnight without a code change**
  - Trigger: Reranker model updated; query/passage format spec drifted
  - Prevention: Pin the reranker model card as a test fixture; fail CI on format drift
- **[MID] BM25 can't find documents referenced by exact identifier**
  - Trigger: Default prose tokeniser splits identifiers and code tokens wrongly
  - Prevention: Swap in a code-aware tokeniser when indexing code corpora
- **[STAFF] Citations link to the wrong paragraph after a chunker change**
  - Trigger: Chunk IDs are sequential indices; any re-chunk invalidates all prior citations
  - Prevention: Chunk IDs must be content-hashes, not positional
- **[SR] Recall varies wildly between query types; each engineer tunes differently**
  - Trigger: Reciprocal-Rank-Fusion k parameter being hand-tuned per query class
  - Prevention: Pin k=60 (literature default); only move it with golden-set A/B evidence (see: golden-eval-harness)

## Transfer Matrix

_No measured cross-model transfer data._

## Telemetry

_No telemetry recorded._

## Security Review

- Injection surface: **medium**
- Tool allow-list: search:bm25, search:vector, rerank:cross-encoder
- Last scanned: 2026-04-16

### Known issues
- Prompt injection via retrieved passage content is a live risk — generator must treat retrieved text as untrusted and refuse to follow embedded instructions. Use a system-prompt guard.

## Compares With

| Compared to | Axis | Winner | Note |
| --- | --- | --- | --- |
| `pure-vector-rag` | accuracy | self | Hybrid+rerank beats pure vector by ~10–20% NDCG@10 on technical corpora with proper nouns and code symbols. |
| `pure-vector-rag` | latency | other | Pure vector is ~50ms faster — skip hybrid if latency budget is <80ms and accuracy is already acceptable. |
| `pure-bm25` | accuracy | self | BM25 alone loses on paraphrase and conceptual queries. Hybrid adds ~15 points recall on natural-language question sets. |
| `pure-bm25` | cost | other | BM25-only has no embedding or rerank cost. Hybrid adds ~$0.001/query — negligible for most products but not all. |

## Related Packs

- `golden-eval-harness`
- `pattern-decision-tree`

## Changelog

### 0.1.0 — 2026-04-16
_Seed pack — first release._

**Added**
- Initial pack with BM25 + dense + RRF + BGE-rerank pipeline
- Contract with retrieved_docs and citations outputs
- Cost envelope and latency budget table
- Evaluation guidance keyed to golden-eval-harness pack

## Sources

- [Anthropic — Contextual Retrieval](https://www.anthropic.com/news/contextual-retrieval) — Contextual-retrieval post that formalises BM25 + embedding + rerank as the 2024+ default stack.
- [Microsoft — Hybrid search in Azure AI Search](https://learn.microsoft.com/en-us/azure/search/hybrid-search-overview) — Microsoft's official guidance on combining BM25 + vector with RRF; matches the pipeline in this pack.
- [Pinecone — Hybrid search guide](https://docs.pinecone.io/guides/search/hybrid-search) — Vendor-agnostic explanation of sparse+dense fusion, including alpha-weighted variants.
- [Weaviate — Hybrid search docs](https://weaviate.io/developers/weaviate/search/hybrid) — Reference for the alpha parameter and server-side RRF implementation.
- [BGE Reranker model card](https://huggingface.co/BAAI/bge-reranker-large) — Primary source on the recommended open-source cross-encoder; includes input format requirements.
- [ColBERTv2 paper](https://arxiv.org/abs/2112.01488) — Late-interaction alternative to cross-encoder rerank when you need throughput at scale.

## Examples

- [Qdrant hybrid search example](https://qdrant.tech/documentation/tutorials/hybrid-search/) (external)
- [Pinecone hybrid search notebook](https://docs.pinecone.io/guides/search/hybrid-search) (external)
