Experiments
Two experiments that close the MLOps loop — from detecting a problem to training a custom model and deciding whether to deploy it.
Why Build a Custom Reranker?
The off-the-shelf cross-encoder (bge-reranker-v2-m3) is a 568M-parameter transformer that scores text relevance. It's excellent at what it does — but it has fundamental limitations for production search.
Knows nothing about star ratings, review counts, price levels, or whether a restaurant is paying for promotion.
Optimizes only for text relevance. Can't balance relevance vs revenue, promotion campaigns, or profit margins.
~800ms for 50 candidates — a transformer forward pass per pair. LambdaMART scores a feature vector in ~5ms.
Multi-Objective Ranking in Real Platforms
In production food delivery, the ranking function isn't just relevance. Real platforms like Uber Eats, DoorDash, and Grubhub blend multiple objectives:
final_score = w1 × text_relevance
+ w2 × purchase_probability
+ w3 × profit_margin
+ w4 × boost_participation
+ w5 × delivery_time_score
+ w6 × seller_qualityUnderstanding LambdaMART
LambdaMART is a Learning-to-Rank algorithm built on gradient boosted decision trees. It learns to order documents by relevance using a feature vector — not raw text.
LambdaMART
Learning to Rank
LambdaMART
Learning to RankWhat it is
LambdaMART combines two ideas: Lambda (a gradient formulation that directly optimizes ranking metrics like NDCG) and MART (Multiple Additive Regression Trees — boosted decision trees). Each tree corrects the errors of the previous ones. The “lambda” trick computes gradients based on how swapping two documents would change NDCG, rather than optimizing a simpler pointwise loss.
Why it's used for search ranking
Unlike neural models that need GPUs and process raw text, LambdaMART works on pre-computed feature vectors and runs on CPU in milliseconds. It's the industry standard for the final ranking stage at companies like Microsoft (Bing), Yahoo, and Airbnb — wherever you need fast, interpretable, multi-signal ranking.
How training works
Key configuration
Input vs output
Training Pipeline
Step 1: Feature Extraction
For each of the 100 evaluation queries, run all pipeline stages and capture intermediate scores. Join with document metadata and relevance grades to build the feature matrix.
Step 2: Feature Vector
26 features per (query, document) pair
Boost Simulation
Simulating sponsored listings: We don't have real advertisers. We mark restaurants with popularity > 0.4 AND stars ≥ 4.0 as “boosted” — a proxy for restaurants that would pay for promotion. This gives ~46% boosted results — a realistic distribution without pretending we have real ad data.
Results
Relevance vs Revenue Tradeoff
When boosted items enter the ranking, we need to balance two objectives. Sweeping the boost weight (beta) from 0 to 1 reveals the tradeoff curve.
Not yet generated. Run boost analysis to see the tradeoff curve.
Three Practical Approaches
Relevance Floor
A boosted item only gets its boost applied if it passes a minimum relevance score first. Irrelevant items can't buy their way to the top.
final_score = relevance + beta * boost
else:
exclude from boosted slots
When to use: Default approach. Simplest, protects user experience.
Slot Separation
Reserve top 2 positions for boosted results (must pass relevance floor). Positions 3-10 are pure organic ranking. Users know positions 1-2 may be sponsored.
Position 3-10: Organic (pure relevance ranking)
When to use: When you want clear ad/organic separation.
Multi-Objective Blended Score
Blend all objectives into a single score. Most sophisticated — requires click/purchase signals. LambdaMART is a step toward this: it learns the weights from data instead of setting them manually.
When to use: Large-scale platforms with user behavior data (Alibaba, Uber Eats).