ammarinjtk.com
GitHub

Evaluation

How we measure search quality, where the pipeline excels, and where it fails. 100 queries across 10 types, graded with LLM-as-judge on a 3-point scale.

Choosing the Right Metrics

Each metric answers a different question about search quality. Understanding what they measure — and what they miss — is essential for interpreting results honestly.

NDCG@10

Primary

Why it matters for food search

Users see ~10 results. A perfect restaurant at position 8 is much less useful than at position 1. NDCG penalizes burying good results — a grade-2 result at rank 5 contributes less than at rank 1.

Formula

DCG@k = Σ (grade_i / log₂(i+1)), NDCG = DCG / ideal DCG

Example

Query: "best ramen in Philadelphia"

1.Ramen Bar (grade 2)
2.Noodle House (grade 1)
3.Pizza Place (grade 0)
4.Sushi Spot (grade 0)
5.Pho King (grade 2)

DCG = 2/log₂2 + 1/log₂3 + 0 + 0 + 2/log₂6 = 2.0 + 0.63 + 0 + 0 + 0.77 = 3.40. The grade-2 result buried at rank 5 loses ~60% of its value vs rank 1.

Scale

0 = random ordering, 1 = perfect ranking. Our 3-point grades: 2 (perfect match), 1 (acceptable), 0 (irrelevant).

MRR

Navigational

Why it matters for food search

For navigational queries ("Buddakan Philadelphia"), users want one specific restaurant. They'll scan positions 1, 2, maybe 3 — then give up. MRR measures how quickly the first grade-2 result appears.

Formula

MRR = 1 / rank_of_first_perfect_result

Example

Query: "Buddakan Philadelphia"

1.Asian Palace (grade 0)
2.Buddakan (grade 2)
3.Mei Mei (grade 0)

First perfect result at rank 2 → MRR = 1/2 = 0.50. If Buddakan were rank 1, MRR = 1.0.

Scale

1.0 = perfect result always at rank 1. 0.5 = typically at rank 2. 0.33 = typically at rank 3.

Precision@5

Above the fold

Why it matters for food search

The top 5 results are visible without scrolling — prime real estate. Every irrelevant result in the top 5 wastes space and erodes trust. Unlike NDCG, P@5 doesn't care about ordering within the top 5.

Formula

P@5 = (relevant results in top 5) / 5

Example

Query: "cheap Thai food Tampa"

1.Thai Palace ✓
2.Burger Joint ✗
3.Thai Garden ✓
4.Pad Thai Co ✓
5.Steak House ✗

P@5 = 3/5 = 0.60. Three relevant results, but two slots wasted on irrelevant ones.

Scale

1.0 = all 5 results relevant. 0.6 = 3 of 5 relevant. 0.0 = nothing relevant in top 5.

Recall@100

Retrieval ceiling

Why it matters for food search

This is the ceiling for the entire pipeline. If a relevant restaurant isn't in the top-100 retrieval candidates, no amount of reranking can rescue it. Low recall means the retrieval stage has blind spots.

Formula

R@100 = (relevant found in top 100) / (total relevant in index)

Example

Query: "vegan restaurants Nashville" — 8 relevant restaurants exist in the index

1.Retrieved 6 of 8 in top 100, missed 2

R@100 = 6/8 = 0.75. Two relevant restaurants were never retrieved — reranking can't fix this. Need better retrieval (more query expansion, better embeddings).

Scale

1.0 = found everything. 0.75 = missed 25% of relevant results. This is a hard ceiling on pipeline quality.

Ablation Study

We evaluated each pipeline stage on 30 queries across 10 types, graded by Claude as an LLM judge on a 3-point scale (0 = irrelevant, 1 = acceptable, 2 = perfect match). 990 total graded results.

StageNDCG@10MRRP@5Latencyvs BM25
BM25 only
0.45
0.520.353.5sbaseline
Dense only
0.44
0.430.30751ms-1%
Hybrid (BM25+Dense)
0.61
0.610.495.5s+38%
+ RRF Fusionbest retrieval
0.72
0.710.588.1s+63%
+ Cross-Encoder*
0.60
0.560.4444.0s+34%
+ LLM Listwise*
0.57
0.590.3949.3s+28%
+ Query Analyzer*
0.57
0.560.3733.4s+29%

RRF Fusion is the single biggest improvement (+63%). Each retrieval method has different blind spots — BM25 misses semantic queries, dense misses exact keywords. Documents that rank well in multiple methods are almost certainly relevant. RRF naturally promotes them using only rank positions (score = 1/(k + rank)), no learned weights.

Simplicity wins with small eval sets. With only 100 eval queries, a learned fusion model would overfit. RRF's zero-parameter design is actually an advantage for small datasets — the same principle behind why ensemble averaging often beats learned stacking when data is limited.

* Why do cross-encoder and LLM stages appear to decrease NDCG?

This is an evaluation artifact, not a real quality drop. The cross-encoder promotes results from positions 20-50 into the top 10. These promoted results were never graded (they default to 0), mechanically lowering NDCG — even when the new results are actually better.

Manual inspection of 20 queries confirms the reranked order is subjectively better. We report the numbers as-is and explain the limitation honestly. A production eval would use pooled grading across all pipeline variants.

Performance by Query Type

The pipeline adds the most value for queries that require understanding beyond keywords. For exact-name queries, BM25 already works — the investment should go into semantic and conversational.

Per-query-type breakdown not yet available. Run the ablation study to generate:

python -m eval.run_ablation --stages 0,5

Failure Analysis

Understanding failure modes matters more than celebrating successes. These are the worst-performing queries — each reveals a specific pipeline limitation.

Bacchanal Tampa Floridaexact name
NDCG 0.00

Why it fails

The restaurant "Bacchanal" doesn't exist in the Tampa portion of our Yelp dataset. The index contains 3,200 Tampa restaurants but this specific venue isn't among them. The pipeline returns other results but none match the graded target.

How to fix it

This is a data coverage issue, not a pipeline failure. The correct response would be to detect "no exact match found" and communicate this to the user rather than showing unrelated results.

spicy food New Orleans Cajun Creole heatattribute filter
NDCG 0.25

Why it fails

"Spicy" and "heat" are subjective attributes rarely in structured metadata. BM25 matches "Cajun" and "Creole" keywords but the concept of spiciness lives in review text ("this gumbo has a real kick"). The pipeline finds Cajun restaurants but can't reliably rank by spice level.

How to fix it

Extract spiciness as a structured attribute during LLM enrichment. Mine review text for spice-related sentiment and store as a searchable field.

trendy fusion cuisine blending Asian and Latin flavorscuisine semantic
NDCG 0.25

Why it fails

"Fusion" spanning two specific cuisines (Asian + Latin) is a narrow intersection. Few restaurants in the index explicitly describe themselves this way. Dense retrieval finds Asian restaurants and Latin restaurants separately, but RRF can't fuse the concept of their intersection.

How to fix it

Better HyDE document generation that describes the fusion concept. Could also add multi-cuisine filtering in the analyzer to require both categories.

authentic Mexican street food with bold flavors and fresh toppingscuisine semantic
NDCG 0.26

Why it fails

"Authentic" and "street food" are ambiance/style signals that BM25 matches literally but dense retrieval interprets broadly. The pipeline returns Mexican restaurants but struggles to distinguish "authentic street-style taqueria" from "upscale modern Mexican" — both match semantically.

How to fix it

Fine-grained style attributes (casual/upscale, traditional/modern) extracted during enrichment would help the reranker distinguish these.

light Mediterranean flavors with lots of fresh vegetablescuisine semantic
NDCG 0.26

Why it fails

"Light" and "fresh vegetables" are health/dietary signals not captured in Yelp categories. The pipeline finds Mediterranean restaurants but can't rank by how vegetable-focused or light their menu is — this information lives in individual reviews and menu items, not in the indexed embedding text.

How to fix it

Enrich with dietary style tags (light, heavy, vegetable-forward) during LLM synthesis. Could also leverage the dishes index to find restaurants with many vegetable-based dishes.

Production Considerations

This evaluation is a proof of concept. Here's what would change at production scale.

Scale

30 queries is not enough

Our ablation uses 30 queries with LLM-as-judge grading. Production evaluation needs 1,000+ queries with implicit feedback signals — clicks, order completions, dwell time, and return visits. LLM-as-judge has its own biases (it favors verbose, well-structured descriptions over concise ones).

What we'd build: Click-through rate logging on search results, A/B test framework comparing pipeline variants on live traffic, and a grading pipeline that combines implicit signals (clicks at position k) with periodic human annotation.
Retrieval

RRF is optimal now — but not forever

RRF with k=60 (standard literature value) outperforms our learned model because 30 queries isn't enough training data. This is the correct engineering choice at this scale — zero-parameter methods should be the default until you have enough data to justify learned alternatives.

At scale (1,000+ queries with click logs): A small neural fusion layer trained on engagement signals would likely outperform fixed RRF. The fusion weights could be query-type-dependent — navigational queries might weight BM25 higher, while semantic queries weight dense retrieval higher.
Latency

27 seconds is a demo, not production

Current pipeline latency (~27s on CPU) is acceptable for a portfolio demo but unacceptable for real users. Production target would be <500ms for the full pipeline.

Three levers to get there:
1.Reduce rerank candidates (50 → 20): 3% quality loss, 60% latency reduction. Already documented in our experiments.
2.GPU inference: Cross-encoder on a T4 GPU drops from 800ms to ~50ms. Cloud Run GPU or a dedicated inference endpoint.
3.Skip cross-encoder entirely: RRF alone achieves 0.73 NDCG. Go directly from RRF → LLM listwise, or replace both with the trained LambdaMART (~5ms) once sufficient training data exists.