How the Pipeline Works
Five sequential stages transform a natural language query into ranked results. Each stage solves a specific retrieval problem that the previous stages cannot.
1Query Analyzer
claude-haiku-4-5 · < 500ms
Query Analyzer
claude-haiku-4-5 · < 500ms
Objective: Transform an unstructured natural language query into a structured representation that downstream stages can use for precise retrieval and filtering.
The problem it solves
BM25 treats "cozy ramen locals love not too loud" as eight separate keywords. It has no concept that "not too loud" is a negative constraint, or that "locals love" is a sentiment signal.
Why this approach
The space of food queries is infinite. "Something warm for a rainy night", "anniversary dinner under $80" — all require genuine language understanding. Rule-based parsing can't handle this diversity.
How it works
- Raw query is wrapped in a structured prompt with output schema definition
- Claude Haiku generates JSON with rewritten_query (expanded keywords for BM25), hyde_document (ideal answer for dense retrieval), intent, search target, filters, and negatives
- The HyDE document is a hypothetical perfect restaurant description — when embedded, it lands near real relevant restaurants in vector space
- If the API call fails, a fallback passes the raw query through (BM25 still works)
Example input / output
2Hybrid Retrieval
BGE-M3 + Elastic · < 200ms
Hybrid Retrieval
BGE-M3 + Elastic · < 200ms
Objective: Find the top-100 candidate documents from 13,591 restaurants using three complementary retrieval methods, then fuse them into one ranked list.
The problem it solves
BM25 only matches exact keywords ("ramen" but not "noodle soup"). Dense-only retrieval catches semantics but can miss specific terms. Neither alone is sufficient.
Why this approach
Each method has different blind spots. BM25 misses "cozy" → "intimate". Dense misses exact dish names. RRF combines them without weight tuning — documents that rank well in multiple methods are almost certainly relevant.
How it works
- BM25 path: rewritten_query → Elastic multi_match on name, categories, embedding_text, reviews (top 100)
- Dense path: HyDE document → BGE-M3 embedding → Elastic kNN cosine search on dense_vector field (top 100)
- Sparse path: rewritten_query → BGE-M3 sparse encoding → Elastic rank_features on sparse_vector (top 100)
- All three run in parallel (~200ms total)
- RRF fusion: score = Σ 1/(k + rank_i) — documents appearing in multiple lists get higher scores
- Output: top-50 candidates ordered by fused score
Example input / output
3Cross-Encoder Reranking
bge-reranker-v2-m3 · < 300ms
Cross-Encoder Reranking
bge-reranker-v2-m3 · < 300ms
Objective: Re-score the top-50 candidates using a model that reads the query and document together, catching fine-grained relevance signals that bi-encoders miss.
The problem it solves
Bi-encoder retrieval encodes query and documents separately — they can't attend to word-level interactions. "Spicy Thai" might match a Thai restaurant that reviews describe as "not spicy at all."
Why this approach
Cross-encoders are more accurate than bi-encoders because they see both texts simultaneously. Too slow for initial retrieval (O(n) for all docs) but perfect for reranking a small candidate set.
How it works
- For each candidate, build a text pair: [query, "Restaurant Name. Categories. Embedding text. Stars: 4.5. Price: $$"]
- Pass all 50 pairs through bge-reranker-v2-m3 (568M params)
- The model uses cross-attention: every query word attends to every document word (and vice versa)
- Output: one float score per pair — higher means more relevant
- Sort by score, take top 20
- This is the compute bottleneck: 50 transformer forward passes on CPU (~10-20s cold, ~300ms warm)
Example input / output
4LLM Listwise Reranking
claude-haiku-4-5 · < 2000ms
LLM Listwise Reranking
claude-haiku-4-5 · < 2000ms
Objective: Holistically reorder the final candidates considering the full query intent — including negative constraints, occasion, and atmosphere — and provide human-readable reasoning for each top result.
The problem it solves
Cross-encoders score each pair independently. They cannot compare candidates against each other or reason about constraints like "not too loud" across the full set.
Why this approach
Only an LLM can reason: "this restaurant's reviews mention quiet atmosphere — better match for the not-loud constraint than that one, which reviews describe as lively." Pairwise models cannot do comparative reasoning.
How it works
- Format top-20 candidates into a numbered list with name, categories, price, stars, and review excerpt
- Send to Claude Haiku with a system prompt defining ranking criteria (relevance > atmosphere > dietary > price > sentiment)
- The LLM returns a JSON with reordered ranking and natural-language reasoning for top 3
- Attach match_reason to each result — this becomes the "why this result" text shown to users
- If the LLM returns invalid JSON, fall back to the cross-encoder ordering
Example input / output
Technical Decisions
Why RRF over learned fusion
Parameter-free — no weights to tune. RRF delivers a strong baseline (+63% NDCG over BM25) without the risk of overfitting to a small eval set.
Why CPU over GPU inference
Cloud Run CPU instances scale to zero (no idle cost). GPU would reduce latency but costs ~10x more at portfolio traffic levels. CPU-only keeps the demo free.
Why Elastic Serverless
Scale-to-zero pricing suits portfolio budgets. Supports BM25, dense kNN, and sparse rank_features in one index — no separate vector DB needed.
Why Claude Haiku over Sonnet
API key only has Haiku access. Honest about it — Haiku provides sufficient quality for query analysis and adds qualitative reasoning (match_reason) with marginal NDCG difference.
Models Used
| Model | Type | Params | Input | Output | Hosted On |
|---|---|---|---|---|---|
| BAAI/bge-m3 | Bi-encoder | 568M | Text (512 tokens) | 1024-dim dense + sparse | Cloud Run (CPU) |
| BAAI/bge-reranker-v2-m3 | Cross-encoder | 568M | (query, doc) pair | Relevance score | Cloud Run (CPU) |
| Claude Haiku 4.5 | LLM | Unknown (API) | Prompt + query | Structured JSON | Anthropic API |
| Claude Haiku 4.5 | LLM | Unknown (API) | Prompt + candidates | Ranked list | Anthropic API |
How do we know it works?
See the full ablation study, per-query-type breakdown, failure analysis, and custom reranker comparison.
See it in action
Click a query to search — then open the inspector panel to see how each stage changes the ranking.