How the Pipeline Works

Five sequential stages transform a natural language query into ranked results. Each stage solves a specific retrieval problem that the previous stages cannot.

User Query

Query Analyzer

claude-haiku-4-5 · < 500ms

Objective: Transform an unstructured natural language query into a structured representation that downstream stages can use for precise retrieval and filtering.

The problem it solves

BM25 treats "cozy ramen locals love not too loud" as eight separate keywords. It has no concept that "not too loud" is a negative constraint, or that "locals love" is a sentiment signal.

Why this approach

The space of food queries is infinite. "Something warm for a rainy night", "anniversary dinner under $80" — all require genuine language understanding. Rule-based parsing can't handle this diversity.

How it works

Raw query is wrapped in a structured prompt with output schema definition
Claude Haiku generates JSON with rewritten_query (expanded keywords for BM25), hyde_document (ideal answer for dense retrieval), intent, search target, filters, and negatives
The HyDE document is a hypothetical perfect restaurant description — when embedded, it lands near real relevant restaurants in vector space
If the API call fails, a fallback passes the raw query through (BM25 still works)

Example input / output

Input:

"cozy ramen locals love not too loud"

Output:

intent: discovery

rewritten_query: "cozy ramen noodle soup warm atmosphere local favorite"

hyde_document: "A beloved neighborhood ramen shop with rich, complex broth simmered for hours. Warm lighting, communal seating, and a relaxed vibe..."

categories: ["Ramen", "Japanese"]

negative_constraints: ["loud atmosphere"]

Cost: ~$0.0001/query · Latency: < 500ms

Hybrid Retrieval

BGE-M3 + Elastic · < 200ms

Objective: Find the top-100 candidate documents from 13,591 restaurants using three complementary retrieval methods, then fuse them into one ranked list.

The problem it solves

BM25 only matches exact keywords ("ramen" but not "noodle soup"). Dense-only retrieval catches semantics but can miss specific terms. Neither alone is sufficient.

Why this approach

Each method has different blind spots. BM25 misses "cozy" → "intimate". Dense misses exact dish names. RRF combines them without weight tuning — documents that rank well in multiple methods are almost certainly relevant.

How it works

BM25 path: rewritten_query → Elastic multi_match on name, categories, embedding_text, reviews (top 100)
Dense path: HyDE document → BGE-M3 embedding → Elastic kNN cosine search on dense_vector field (top 100)
Sparse path: rewritten_query → BGE-M3 sparse encoding → Elastic rank_features on sparse_vector (top 100)
All three run in parallel (~200ms total)
RRF fusion: score = Σ 1/(k + rank_i) — documents appearing in multiple lists get higher scores
Output: top-50 candidates ordered by fused score

Example input / output

Input:

From Stage 1: rewritten_query + HyDE document embedding (1024-dim) + sparse vector

Output:

Top-100 candidates with RRF scores

Each doc has: restaurant_id, name, categories, stars, embedding_text, etc.

Cost: $0 (self-hosted) · Latency: < 200ms

BM25 + Dense + Sparse run in parallel, fused with RRF

Cross-Encoder Reranking

bge-reranker-v2-m3 · < 300ms

Objective: Re-score the top-50 candidates using a model that reads the query and document together, catching fine-grained relevance signals that bi-encoders miss.

The problem it solves

Bi-encoder retrieval encodes query and documents separately — they can't attend to word-level interactions. "Spicy Thai" might match a Thai restaurant that reviews describe as "not spicy at all."

Why this approach

Cross-encoders are more accurate than bi-encoders because they see both texts simultaneously. Too slow for initial retrieval (O(n) for all docs) but perfect for reranking a small candidate set.

How it works

For each candidate, build a text pair: [query, "Restaurant Name. Categories. Embedding text. Stars: 4.5. Price: $$"]
Pass all 50 pairs through bge-reranker-v2-m3 (568M params)
The model uses cross-attention: every query word attends to every document word (and vice versa)
Output: one float score per pair — higher means more relevant
Sort by score, take top 20
This is the compute bottleneck: 50 transformer forward passes on CPU (~10-20s cold, ~300ms warm)

Example input / output

Input:

50 pairs of (query, document_text)

Output:

Each pair gets a relevance score (float)

Re-sorted: top-20 candidates by cross-encoder score

Cost: $0 (self-hosted CPU) · Latency: < 300ms

LLM Listwise Reranking

claude-haiku-4-5 · < 2000ms

Objective: Holistically reorder the final candidates considering the full query intent — including negative constraints, occasion, and atmosphere — and provide human-readable reasoning for each top result.

The problem it solves

Cross-encoders score each pair independently. They cannot compare candidates against each other or reason about constraints like "not too loud" across the full set.

Why this approach

Only an LLM can reason: "this restaurant's reviews mention quiet atmosphere — better match for the not-loud constraint than that one, which reviews describe as lively." Pairwise models cannot do comparative reasoning.

How it works

Format top-20 candidates into a numbered list with name, categories, price, stars, and review excerpt
Send to Claude Haiku with a system prompt defining ranking criteria (relevance > atmosphere > dietary > price > sentiment)
The LLM returns a JSON with reordered ranking and natural-language reasoning for top 3
Attach match_reason to each result — this becomes the "why this result" text shown to users
If the LLM returns invalid JSON, fall back to the cross-encoder ordering

Example input / output

Input:

Original query + 20 candidates formatted as: "[1] Ramen Bar — Japanese, Ramen — $$ — 4.5★ — 'Rich tonkotsu broth...'"]

Output:

ranking: [3, 1, 7, 2, ...] — reordered positions

top3_reasoning: [{ rank: 1, name: "Warm Bowl Ramen", reason: "Rich broth and intimate seating directly match the cozy atmosphere request" }]

Cost: ~$0.003/query · Latency: < 2000ms

Final Results (top-10)

Technical Decisions

Why RRF over learned fusion

Parameter-free — no weights to tune. RRF delivers a strong baseline (+63% NDCG over BM25) without the risk of overfitting to a small eval set.

Why CPU over GPU inference

Cloud Run CPU instances scale to zero (no idle cost). GPU would reduce latency but costs ~10x more at portfolio traffic levels. CPU-only keeps the demo free.

Why Elastic Serverless

Scale-to-zero pricing suits portfolio budgets. Supports BM25, dense kNN, and sparse rank_features in one index — no separate vector DB needed.

Why Claude Haiku over Sonnet

API key only has Haiku access. Honest about it — Haiku provides sufficient quality for query analysis and adds qualitative reasoning (match_reason) with marginal NDCG difference.

Models Used

Model	Type	Params	Input	Output	Hosted On
BAAI/bge-m3	Bi-encoder	568M	Text (512 tokens)	1024-dim dense + sparse	Cloud Run (CPU)
BAAI/bge-reranker-v2-m3	Cross-encoder	568M	(query, doc) pair	Relevance score	Cloud Run (CPU)
Claude Haiku 4.5	LLM	Unknown (API)	Prompt + query	Structured JSON	Anthropic API
Claude Haiku 4.5	LLM	Unknown (API)	Prompt + candidates	Ranked list	Anthropic API