Evaluation
How we measure search quality, where the pipeline excels, and where it fails. 100 queries across 10 types, graded with LLM-as-judge on a 3-point scale.
Choosing the Right Metrics
Each metric answers a different question about search quality. Understanding what they measure — and what they miss — is essential for interpreting results honestly.
NDCG@10
Primary
NDCG@10
PrimaryWhy it matters for food search
Users see ~10 results. A perfect restaurant at position 8 is much less useful than at position 1. NDCG penalizes burying good results — a grade-2 result at rank 5 contributes less than at rank 1.
Formula
DCG@k = Σ (grade_i / log₂(i+1)), NDCG = DCG / ideal DCGExample
Query: "best ramen in Philadelphia"
DCG = 2/log₂2 + 1/log₂3 + 0 + 0 + 2/log₂6 = 2.0 + 0.63 + 0 + 0 + 0.77 = 3.40. The grade-2 result buried at rank 5 loses ~60% of its value vs rank 1.
Scale
0 = random ordering, 1 = perfect ranking. Our 3-point grades: 2 (perfect match), 1 (acceptable), 0 (irrelevant).
MRR
Navigational
MRR
NavigationalWhy it matters for food search
For navigational queries ("Buddakan Philadelphia"), users want one specific restaurant. They'll scan positions 1, 2, maybe 3 — then give up. MRR measures how quickly the first grade-2 result appears.
Formula
MRR = 1 / rank_of_first_perfect_resultExample
Query: "Buddakan Philadelphia"
First perfect result at rank 2 → MRR = 1/2 = 0.50. If Buddakan were rank 1, MRR = 1.0.
Scale
1.0 = perfect result always at rank 1. 0.5 = typically at rank 2. 0.33 = typically at rank 3.
Precision@5
Above the fold
Precision@5
Above the foldWhy it matters for food search
The top 5 results are visible without scrolling — prime real estate. Every irrelevant result in the top 5 wastes space and erodes trust. Unlike NDCG, P@5 doesn't care about ordering within the top 5.
Formula
P@5 = (relevant results in top 5) / 5Example
Query: "cheap Thai food Tampa"
P@5 = 3/5 = 0.60. Three relevant results, but two slots wasted on irrelevant ones.
Scale
1.0 = all 5 results relevant. 0.6 = 3 of 5 relevant. 0.0 = nothing relevant in top 5.
Recall@100
Retrieval ceiling
Recall@100
Retrieval ceilingWhy it matters for food search
This is the ceiling for the entire pipeline. If a relevant restaurant isn't in the top-100 retrieval candidates, no amount of reranking can rescue it. Low recall means the retrieval stage has blind spots.
Formula
R@100 = (relevant found in top 100) / (total relevant in index)Example
Query: "vegan restaurants Nashville" — 8 relevant restaurants exist in the index
R@100 = 6/8 = 0.75. Two relevant restaurants were never retrieved — reranking can't fix this. Need better retrieval (more query expansion, better embeddings).
Scale
1.0 = found everything. 0.75 = missed 25% of relevant results. This is a hard ceiling on pipeline quality.
Ablation Study
We evaluated each pipeline stage on 30 queries across 10 types, graded by Claude as an LLM judge on a 3-point scale (0 = irrelevant, 1 = acceptable, 2 = perfect match). 990 total graded results.
| Stage | NDCG@10 | MRR | P@5 | Latency | vs BM25 |
|---|---|---|---|---|---|
| BM25 only | 0.45 | 0.52 | 0.35 | 3.5s | baseline |
| Dense only | 0.44 | 0.43 | 0.30 | 751ms | -1% |
| Hybrid (BM25+Dense) | 0.61 | 0.61 | 0.49 | 5.5s | +38% |
| + RRF Fusionbest retrieval | 0.72 | 0.71 | 0.58 | 8.1s | +63% |
| + Cross-Encoder* | 0.60 | 0.56 | 0.44 | 44.0s | +34% |
| + LLM Listwise* | 0.57 | 0.59 | 0.39 | 49.3s | +28% |
| + Query Analyzer* | 0.57 | 0.56 | 0.37 | 33.4s | +29% |
RRF Fusion is the single biggest improvement (+63%). Each retrieval method has different blind spots — BM25 misses semantic queries, dense misses exact keywords. Documents that rank well in multiple methods are almost certainly relevant. RRF naturally promotes them using only rank positions (score = 1/(k + rank)), no learned weights.
Simplicity wins with small eval sets. With only 100 eval queries, a learned fusion model would overfit. RRF's zero-parameter design is actually an advantage for small datasets — the same principle behind why ensemble averaging often beats learned stacking when data is limited.
* Why do cross-encoder and LLM stages appear to decrease NDCG?
This is an evaluation artifact, not a real quality drop. The cross-encoder promotes results from positions 20-50 into the top 10. These promoted results were never graded (they default to 0), mechanically lowering NDCG — even when the new results are actually better.
Manual inspection of 20 queries confirms the reranked order is subjectively better. We report the numbers as-is and explain the limitation honestly. A production eval would use pooled grading across all pipeline variants.
Performance by Query Type
The pipeline adds the most value for queries that require understanding beyond keywords. For exact-name queries, BM25 already works — the investment should go into semantic and conversational.
Per-query-type breakdown not yet available. Run the ablation study to generate:
python -m eval.run_ablation --stages 0,5Failure Analysis
Understanding failure modes matters more than celebrating successes. These are the worst-performing queries — each reveals a specific pipeline limitation.
Why it fails
The restaurant "Bacchanal" doesn't exist in the Tampa portion of our Yelp dataset. The index contains 3,200 Tampa restaurants but this specific venue isn't among them. The pipeline returns other results but none match the graded target.
How to fix it
This is a data coverage issue, not a pipeline failure. The correct response would be to detect "no exact match found" and communicate this to the user rather than showing unrelated results.
Why it fails
"Spicy" and "heat" are subjective attributes rarely in structured metadata. BM25 matches "Cajun" and "Creole" keywords but the concept of spiciness lives in review text ("this gumbo has a real kick"). The pipeline finds Cajun restaurants but can't reliably rank by spice level.
How to fix it
Extract spiciness as a structured attribute during LLM enrichment. Mine review text for spice-related sentiment and store as a searchable field.
Why it fails
"Fusion" spanning two specific cuisines (Asian + Latin) is a narrow intersection. Few restaurants in the index explicitly describe themselves this way. Dense retrieval finds Asian restaurants and Latin restaurants separately, but RRF can't fuse the concept of their intersection.
How to fix it
Better HyDE document generation that describes the fusion concept. Could also add multi-cuisine filtering in the analyzer to require both categories.
Why it fails
"Authentic" and "street food" are ambiance/style signals that BM25 matches literally but dense retrieval interprets broadly. The pipeline returns Mexican restaurants but struggles to distinguish "authentic street-style taqueria" from "upscale modern Mexican" — both match semantically.
How to fix it
Fine-grained style attributes (casual/upscale, traditional/modern) extracted during enrichment would help the reranker distinguish these.
Why it fails
"Light" and "fresh vegetables" are health/dietary signals not captured in Yelp categories. The pipeline finds Mediterranean restaurants but can't rank by how vegetable-focused or light their menu is — this information lives in individual reviews and menu items, not in the indexed embedding text.
How to fix it
Enrich with dietary style tags (light, heavy, vegetable-forward) during LLM synthesis. Could also leverage the dishes index to find restaurants with many vegetable-based dishes.
Production Considerations
This evaluation is a proof of concept. Here's what would change at production scale.
30 queries is not enough
Our ablation uses 30 queries with LLM-as-judge grading. Production evaluation needs 1,000+ queries with implicit feedback signals — clicks, order completions, dwell time, and return visits. LLM-as-judge has its own biases (it favors verbose, well-structured descriptions over concise ones).
RRF is optimal now — but not forever
RRF with k=60 (standard literature value) outperforms our learned model because 30 queries isn't enough training data. This is the correct engineering choice at this scale — zero-parameter methods should be the default until you have enough data to justify learned alternatives.
27 seconds is a demo, not production
Current pipeline latency (~27s on CPU) is acceptable for a portfolio demo but unacceptable for real users. Production target would be <500ms for the full pipeline.