ammarinjtk.com
GitHub

How the Pipeline Works

Five sequential stages transform a natural language query into ranked results. Each stage solves a specific retrieval problem that the previous stages cannot.

User Query
1

Query Analyzer

claude-haiku-4-5 · < 500ms

Objective: Transform an unstructured natural language query into a structured representation that downstream stages can use for precise retrieval and filtering.

The problem it solves

BM25 treats "cozy ramen locals love not too loud" as eight separate keywords. It has no concept that "not too loud" is a negative constraint, or that "locals love" is a sentiment signal.

Why this approach

The space of food queries is infinite. "Something warm for a rainy night", "anniversary dinner under $80" — all require genuine language understanding. Rule-based parsing can't handle this diversity.

How it works

  1. Raw query is wrapped in a structured prompt with output schema definition
  2. Claude Haiku generates JSON with rewritten_query (expanded keywords for BM25), hyde_document (ideal answer for dense retrieval), intent, search target, filters, and negatives
  3. The HyDE document is a hypothetical perfect restaurant description — when embedded, it lands near real relevant restaurants in vector space
  4. If the API call fails, a fallback passes the raw query through (BM25 still works)

Example input / output

Input:
"cozy ramen locals love not too loud"
Output:
intent: discovery
rewritten_query: "cozy ramen noodle soup warm atmosphere local favorite"
hyde_document: "A beloved neighborhood ramen shop with rich, complex broth simmered for hours. Warm lighting, communal seating, and a relaxed vibe..."
categories: ["Ramen", "Japanese"]
negative_constraints: ["loud atmosphere"]
Cost: ~$0.0001/query · Latency: < 500ms
2

Hybrid Retrieval

BGE-M3 + Elastic · < 200ms

Objective: Find the top-100 candidate documents from 13,591 restaurants using three complementary retrieval methods, then fuse them into one ranked list.

The problem it solves

BM25 only matches exact keywords ("ramen" but not "noodle soup"). Dense-only retrieval catches semantics but can miss specific terms. Neither alone is sufficient.

Why this approach

Each method has different blind spots. BM25 misses "cozy" → "intimate". Dense misses exact dish names. RRF combines them without weight tuning — documents that rank well in multiple methods are almost certainly relevant.

How it works

  1. BM25 path: rewritten_query → Elastic multi_match on name, categories, embedding_text, reviews (top 100)
  2. Dense path: HyDE document → BGE-M3 embedding → Elastic kNN cosine search on dense_vector field (top 100)
  3. Sparse path: rewritten_query → BGE-M3 sparse encoding → Elastic rank_features on sparse_vector (top 100)
  4. All three run in parallel (~200ms total)
  5. RRF fusion: score = Σ 1/(k + rank_i) — documents appearing in multiple lists get higher scores
  6. Output: top-50 candidates ordered by fused score

Example input / output

Input:
From Stage 1: rewritten_query + HyDE document embedding (1024-dim) + sparse vector
Output:
Top-100 candidates with RRF scores
Each doc has: restaurant_id, name, categories, stars, embedding_text, etc.
Cost: $0 (self-hosted) · Latency: < 200ms
BM25 + Dense + Sparse run in parallel, fused with RRF
3

Cross-Encoder Reranking

bge-reranker-v2-m3 · < 300ms

Objective: Re-score the top-50 candidates using a model that reads the query and document together, catching fine-grained relevance signals that bi-encoders miss.

The problem it solves

Bi-encoder retrieval encodes query and documents separately — they can't attend to word-level interactions. "Spicy Thai" might match a Thai restaurant that reviews describe as "not spicy at all."

Why this approach

Cross-encoders are more accurate than bi-encoders because they see both texts simultaneously. Too slow for initial retrieval (O(n) for all docs) but perfect for reranking a small candidate set.

How it works

  1. For each candidate, build a text pair: [query, "Restaurant Name. Categories. Embedding text. Stars: 4.5. Price: $$"]
  2. Pass all 50 pairs through bge-reranker-v2-m3 (568M params)
  3. The model uses cross-attention: every query word attends to every document word (and vice versa)
  4. Output: one float score per pair — higher means more relevant
  5. Sort by score, take top 20
  6. This is the compute bottleneck: 50 transformer forward passes on CPU (~10-20s cold, ~300ms warm)

Example input / output

Input:
50 pairs of (query, document_text)
Output:
Each pair gets a relevance score (float)
Re-sorted: top-20 candidates by cross-encoder score
Cost: $0 (self-hosted CPU) · Latency: < 300ms
4

LLM Listwise Reranking

claude-haiku-4-5 · < 2000ms

Objective: Holistically reorder the final candidates considering the full query intent — including negative constraints, occasion, and atmosphere — and provide human-readable reasoning for each top result.

The problem it solves

Cross-encoders score each pair independently. They cannot compare candidates against each other or reason about constraints like "not too loud" across the full set.

Why this approach

Only an LLM can reason: "this restaurant's reviews mention quiet atmosphere — better match for the not-loud constraint than that one, which reviews describe as lively." Pairwise models cannot do comparative reasoning.

How it works

  1. Format top-20 candidates into a numbered list with name, categories, price, stars, and review excerpt
  2. Send to Claude Haiku with a system prompt defining ranking criteria (relevance > atmosphere > dietary > price > sentiment)
  3. The LLM returns a JSON with reordered ranking and natural-language reasoning for top 3
  4. Attach match_reason to each result — this becomes the "why this result" text shown to users
  5. If the LLM returns invalid JSON, fall back to the cross-encoder ordering

Example input / output

Input:
Original query + 20 candidates formatted as: "[1] Ramen Bar — Japanese, Ramen — $$ — 4.5★ — 'Rich tonkotsu broth...'"]
Output:
ranking: [3, 1, 7, 2, ...] — reordered positions
top3_reasoning: [{ rank: 1, name: "Warm Bowl Ramen", reason: "Rich broth and intimate seating directly match the cozy atmosphere request" }]
Cost: ~$0.003/query · Latency: < 2000ms
Final Results (top-10)

Technical Decisions

Why RRF over learned fusion

Parameter-free — no weights to tune. RRF delivers a strong baseline (+63% NDCG over BM25) without the risk of overfitting to a small eval set.

Why CPU over GPU inference

Cloud Run CPU instances scale to zero (no idle cost). GPU would reduce latency but costs ~10x more at portfolio traffic levels. CPU-only keeps the demo free.

Why Elastic Serverless

Scale-to-zero pricing suits portfolio budgets. Supports BM25, dense kNN, and sparse rank_features in one index — no separate vector DB needed.

Why Claude Haiku over Sonnet

API key only has Haiku access. Honest about it — Haiku provides sufficient quality for query analysis and adds qualitative reasoning (match_reason) with marginal NDCG difference.

Models Used

ModelTypeParamsInputOutputHosted On
BAAI/bge-m3Bi-encoder568MText (512 tokens)1024-dim dense + sparseCloud Run (CPU)
BAAI/bge-reranker-v2-m3Cross-encoder568M(query, doc) pairRelevance scoreCloud Run (CPU)
Claude Haiku 4.5LLMUnknown (API)Prompt + queryStructured JSONAnthropic API
Claude Haiku 4.5LLMUnknown (API)Prompt + candidatesRanked listAnthropic API

How do we know it works?

See the full ablation study, per-query-type breakdown, failure analysis, and custom reranker comparison.

Tech Stack

Search index
Elastic Cloud Serverless
Dense + sparse
BAAI/bge-m3
Cross-encoder
bge-reranker-v2-m3
LLM
Claude Haiku 4.5
Model serving
GCP Cloud Run (CPU)
API
FastAPI
UI
Next.js 15
Data
Yelp Open Dataset
Dataset
13K restaurants, 53K dishes