Evaluation

How we measure search quality, where the pipeline excels, and where it fails. 100 queries across 10 types, graded with LLM-as-judge on a 3-point scale.

Choosing the Right Metrics

Each metric answers a different question about search quality. Understanding what they measure — and what they miss — is essential for interpreting results honestly.

NDCG@10

Primary

Are the best results ranked highest?

Why it matters for food search

Users see ~10 results. A perfect restaurant at position 8 is much less useful than at position 1. NDCG penalizes burying good results — a grade-2 result at rank 5 contributes less than at rank 1.

Formula

DCG@k = Σ (grade_i / log₂(i+1)), NDCG = DCG / ideal DCG

Example

Query: "best ramen in Philadelphia"

1.Ramen Bar (grade 2)

2.Noodle House (grade 1)

3.Pizza Place (grade 0)

4.Sushi Spot (grade 0)

5.Pho King (grade 2)

DCG = 2/log₂2 + 1/log₂3 + 0 + 0 + 2/log₂6 = 2.0 + 0.63 + 0 + 0 + 0.77 = 3.40. The grade-2 result buried at rank 5 loses ~60% of its value vs rank 1.

Scale

0 = random ordering, 1 = perfect ranking. Our 3-point grades: 2 (perfect match), 1 (acceptable), 0 (irrelevant).

MRR

Navigational

How fast do we find the exact right answer?

Why it matters for food search

For navigational queries ("Buddakan Philadelphia"), users want one specific restaurant. They'll scan positions 1, 2, maybe 3 — then give up. MRR measures how quickly the first grade-2 result appears.

Formula

MRR = 1 / rank_of_first_perfect_result

Example

Query: "Buddakan Philadelphia"

1.Asian Palace (grade 0)

2.Buddakan (grade 2)

3.Mei Mei (grade 0)

First perfect result at rank 2 → MRR = 1/2 = 0.50. If Buddakan were rank 1, MRR = 1.0.

Scale

1.0 = perfect result always at rank 1. 0.5 = typically at rank 2. 0.33 = typically at rank 3.

Precision@5

Above the fold

How many top-5 results are actually useful?

Why it matters for food search

The top 5 results are visible without scrolling — prime real estate. Every irrelevant result in the top 5 wastes space and erodes trust. Unlike NDCG, P@5 doesn't care about ordering within the top 5.

Formula

P@5 = (relevant results in top 5) / 5

Example

Query: "cheap Thai food Tampa"

1.Thai Palace ✓

2.Burger Joint ✗

3.Thai Garden ✓

4.Pad Thai Co ✓

5.Steak House ✗

P@5 = 3/5 = 0.60. Three relevant results, but two slots wasted on irrelevant ones.

Scale

1.0 = all 5 results relevant. 0.6 = 3 of 5 relevant. 0.0 = nothing relevant in top 5.

Recall@100

Retrieval ceiling

Did we even find all the good results?

Why it matters for food search

This is the ceiling for the entire pipeline. If a relevant restaurant isn't in the top-100 retrieval candidates, no amount of reranking can rescue it. Low recall means the retrieval stage has blind spots.

Formula

R@100 = (relevant found in top 100) / (total relevant in index)

Example

Query: "vegan restaurants Nashville" — 8 relevant restaurants exist in the index

1.Retrieved 6 of 8 in top 100, missed 2

R@100 = 6/8 = 0.75. Two relevant restaurants were never retrieved — reranking can't fix this. Need better retrieval (more query expansion, better embeddings).

Scale

1.0 = found everything. 0.75 = missed 25% of relevant results. This is a hard ceiling on pipeline quality.

Ablation Study

We evaluated each pipeline stage on 30 queries across 10 types, graded by Claude as an LLM judge on a 3-point scale (0 = irrelevant, 1 = acceptable, 2 = perfect match). 990 total graded results.

Stage	NDCG@10	MRR	P@5	Latency	vs BM25
BM25 only	0.45	0.52	0.35	3.5s	baseline
Dense only	0.44	0.43	0.30	751ms	-1%
Hybrid (BM25+Dense)	0.61	0.61	0.49	5.5s	+38%
+ RRF Fusionbest retrieval	0.72	0.71	0.58	8.1s	+63%
+ Cross-Encoder*	0.60	0.56	0.44	44.0s	+34%
+ LLM Listwise*	0.57	0.59	0.39	49.3s	+28%
+ Query Analyzer*	0.57	0.56	0.37	33.4s	+29%

RRF Fusion is the single biggest improvement (+63%). Each retrieval method has different blind spots — BM25 misses semantic queries, dense misses exact keywords. Documents that rank well in multiple methods are almost certainly relevant. RRF naturally promotes them using only rank positions (score = 1/(k + rank)), no learned weights.

Simplicity wins with small eval sets. With only 100 eval queries, a learned fusion model would overfit. RRF's zero-parameter design is actually an advantage for small datasets — the same principle behind why ensemble averaging often beats learned stacking when data is limited.

* Why do cross-encoder and LLM stages appear to decrease NDCG?

This is an evaluation artifact, not a real quality drop. The cross-encoder promotes results from positions 20-50 into the top 10. These promoted results were never graded (they default to 0), mechanically lowering NDCG — even when the new results are actually better.

Manual inspection of 20 queries confirms the reranked order is subjectively better. We report the numbers as-is and explain the limitation honestly. A production eval would use pooled grading across all pipeline variants.

Performance by Query Type

The pipeline adds the most value for queries that require understanding beyond keywords. For exact-name queries, BM25 already works — the investment should go into semantic and conversational.

Per-query-type breakdown not yet available. Run the ablation study to generate:

"Light" and "fresh vegetables" are health/dietary signals not captured in Yelp categories. The pipeline finds Mediterranean restaurants but can't rank by how vegetable-focused or light their menu is — this information lives in individual reviews and menu items, not in the indexed embedding text.

How to fix it

Enrich with dietary style tags (light, heavy, vegetable-forward) during LLM synthesis. Could also leverage the dishes index to find restaurants with many vegetable-based dishes.

Production Considerations

This evaluation is a proof of concept. Here's what would change at production scale.

Scale

30 queries is not enough

Our ablation uses 30 queries with LLM-as-judge grading. Production evaluation needs 1,000+ queries with implicit feedback signals — clicks, order completions, dwell time, and return visits. LLM-as-judge has its own biases (it favors verbose, well-structured descriptions over concise ones).

What we'd build: Click-through rate logging on search results, A/B test framework comparing pipeline variants on live traffic, and a grading pipeline that combines implicit signals (clicks at position k) with periodic human annotation.

Retrieval

RRF is optimal now — but not forever

RRF with k=60 (standard literature value) outperforms our learned model because 30 queries isn't enough training data. This is the correct engineering choice at this scale — zero-parameter methods should be the default until you have enough data to justify learned alternatives.

At scale (1,000+ queries with click logs): A small neural fusion layer trained on engagement signals would likely outperform fixed RRF. The fusion weights could be query-type-dependent — navigational queries might weight BM25 higher, while semantic queries weight dense retrieval higher.

Latency

27 seconds is a demo, not production

Current pipeline latency (~27s on CPU) is acceptable for a portfolio demo but unacceptable for real users. Production target would be <500ms for the full pipeline.

Three levers to get there:

1.Reduce rerank candidates (50 → 20): 3% quality loss, 60% latency reduction. Already documented in our experiments.

2.GPU inference: Cross-encoder on a T4 GPU drops from 800ms to ~50ms. Cloud Run GPU or a dedicated inference endpoint.

3.Skip cross-encoder entirely: RRF alone achieves 0.73 NDCG. Go directly from RRF → LLM listwise, or replace both with the trained LambdaMART (~5ms) once sufficient training data exists.

Pipeline Architecture

Experiments