ammarinjtk.com
GitHub

Experiments

Two experiments that close the MLOps loop — from detecting a problem to training a custom model and deciding whether to deploy it.

Why Build a Custom Reranker?

The off-the-shelf cross-encoder (bge-reranker-v2-m3) is a 568M-parameter transformer that scores text relevance. It's excellent at what it does — but it has fundamental limitations for production search.

Text only

Knows nothing about star ratings, review counts, price levels, or whether a restaurant is paying for promotion.

Single objective

Optimizes only for text relevance. Can't balance relevance vs revenue, promotion campaigns, or profit margins.

Slow inference

~800ms for 50 candidates — a transformer forward pass per pair. LambdaMART scores a feature vector in ~5ms.

Multi-Objective Ranking in Real Platforms

In production food delivery, the ranking function isn't just relevance. Real platforms like Uber Eats, DoorDash, and Grubhub blend multiple objectives:

final_score = w1 × text_relevance + w2 × purchase_probability + w3 × profit_margin + w4 × boost_participation + w5 × delivery_time_score + w6 × seller_quality
1.Promoted listings — restaurants pay for visibility. A burger place paying for promotion should appear higher, but only if it's actually relevant to “best burgers near me.”
2.Profit margin — a $50 order from a high-commission restaurant is worth more than a $10 order. The platform optimizes for revenue per impression.
3.User satisfaction — but pushing irrelevant promoted results erodes trust. Users leave. The ranking must enforce a relevance floor.

Understanding LambdaMART

LambdaMART is a Learning-to-Rank algorithm built on gradient boosted decision trees. It learns to order documents by relevance using a feature vector — not raw text.

LambdaMART

Learning to Rank

What it is

LambdaMART combines two ideas: Lambda (a gradient formulation that directly optimizes ranking metrics like NDCG) and MART (Multiple Additive Regression Trees — boosted decision trees). Each tree corrects the errors of the previous ones. The “lambda” trick computes gradients based on how swapping two documents would change NDCG, rather than optimizing a simpler pointwise loss.

Why it's used for search ranking

Unlike neural models that need GPUs and process raw text, LambdaMART works on pre-computed feature vectors and runs on CPU in milliseconds. It's the industry standard for the final ranking stage at companies like Microsoft (Bing), Yahoo, and Airbnb — wherever you need fast, interpretable, multi-signal ranking.

How training works

// Training data: groups of (query, document) pairs with relevance grades
Query "best ramen Philadelphia":
Doc A: [bm25=0.82, dense=0.71, ce=3.2, stars=4.5, ...] grade=2 (perfect)
Doc B: [bm25=0.45, dense=0.63, ce=1.8, stars=3.0, ...] grade=1 (acceptable)
Doc C: [bm25=0.91, dense=0.22, ce=0.5, stars=2.5, ...] grade=0 (irrelevant)
// The model learns: "when ce is high AND stars are high → rank higher"
// Each tree learns to fix the mistakes of the previous trees

Key configuration

LightGBM LambdaRank configuration:
objective: "lambdarank" // optimize for ranking, not regression
metric: "ndcg" // evaluate using NDCG@5 and NDCG@10
num_leaves: 31 // tree complexity (more = more expressive)
learning_rate: 0.1 // step size per tree
feature_fraction: 0.8 // random 80% of features per tree (regularization)
early_stopping: 50 // stop if validation NDCG doesn't improve for 50 rounds
// Query-level train/test split (70/30) — critical to prevent data leakage
// Documents from the same query must stay in the same split

Input vs output

Input: feature vector [26 floats] per (query, document) pair
Output: relevance score (float) — higher = rank higher
Inference: ~5ms for 50 candidates (vs ~800ms for cross-encoder)

Training Pipeline

Step 1: Feature Extraction

For each of the 100 evaluation queries, run all pipeline stages and capture intermediate scores. Join with document metadata and relevance grades to build the feature matrix.

// For each query, capture scores at every pipeline stage:
BM25 retrieval (top 100) → bm25_score per doc
Dense retrieval (top 100) → dense_score per doc
Hybrid + RRF (top 100) → rrf_score per doc
Cross-encoder (top 50) → cross_encoder_score per doc
Elastic metadata → stars, popularity, price, num_categories
Analyzer output → intent one-hot, target one-hot
Boost logic → is_boosted, boost_score

Step 2: Feature Vector

26 features per (query, document) pair

Retrieval: BM25, dense, sparse, RRF scores
Reranker: cross-encoder score
Metadata: stars, popularity, price
Query: intent type, search target
Business: boost flag, boost score
Label: relevance grade (0/1/2)

Boost Simulation

Simulating sponsored listings: We don't have real advertisers. We mark restaurants with popularity > 0.4 AND stars ≥ 4.0 as “boosted” — a proxy for restaurants that would pay for promotion. This gives ~46% boosted results — a realistic distribution without pretending we have real ad data.

Results

Loading experiment results...

Relevance vs Revenue Tradeoff

When boosted items enter the ranking, we need to balance two objectives. Sweeping the boost weight (beta) from 0 to 1 reveals the tradeoff curve.

Not yet generated. Run boost analysis to see the tradeoff curve.

Three Practical Approaches

Approach 1

Relevance Floor

A boosted item only gets its boost applied if it passes a minimum relevance score first. Irrelevant items can't buy their way to the top.

if relevance_score ≥ threshold:
final_score = relevance + beta * boost
else:
exclude from boosted slots

When to use: Default approach. Simplest, protects user experience.

Approach 2

Slot Separation

Reserve top 2 positions for boosted results (must pass relevance floor). Positions 3-10 are pure organic ranking. Users know positions 1-2 may be sponsored.

Position 1-2: Sponsored (relevance floor + labeled)
Position 3-10: Organic (pure relevance ranking)

When to use: When you want clear ad/organic separation.

Approach 3

Multi-Objective Blended Score

Blend all objectives into a single score. Most sophisticated — requires click/purchase signals. LambdaMART is a step toward this: it learns the weights from data instead of setting them manually.

When to use: Large-scale platforms with user behavior data (Alibaba, Uber Eats).

Boost Design Rules

1.Always enforce a relevance floor. Never let irrelevant items buy top positions.
2.Label boosted items transparently. User trust and regulatory compliance.
3.Track NDCG separately for organic vs boosted. Detect if boosting hurts UX.
4.A/B test before tuning weights. Ensure revenue gain doesn't erode satisfaction.
5.Cap boost slots per page. Limit sponsored to 2-3 positions max.