Data Foundation Taxonomy Design AI Enablement Menu Quality Pathao Food · 2024–25

Pathao Food
Dish Data Cleaning

Building the structured data layer that powers discovery, personalisation, and AI at Pathao Food - from noisy, inconsistent menu data to a clean, machine-readable taxonomy of cuisines, dishes, and intent signals at scale.

100K+
Menu items cleaned and classified
2 layers
Cuisine + Dish taxonomy enforced at scale
6+
Downstream AI and product features unlocked
~0%
Pre-existing structured taxonomy before this work
1st
Data foundation layer for AI in Pathao Food
01 Context 02 Problem 03 Taxonomy 04 Rules 05 Process 06 Edge Cases 07 AI Downstream 08 Metrics 09 Risks 10 Lessons
01 - Context & Background

Pathao Food had a restaurant graph. It didn't have a food graph.

By 2024, Pathao Food had thousands of restaurant partners across Dhaka, Chittagong, and Kathmandu. Each restaurant had uploaded their own menu - item names, prices, descriptions, and photos. The operational infrastructure was solid. The data underneath it was not.

Every menu item had been entered by individual restaurant owners or onboarding ops teams, each with their own conventions, languages, and levels of care. The result was a database of over 100,000 menu items with no shared schema. There was no standard way to answer the most basic question a food discovery product needs to answer: what kind of food is this?

This wasn't just an ops cleanliness problem. It was a structural ceiling on every AI and personalisation initiative Pathao Food wanted to build. You cannot recommend "Bengali cuisine" to a user if the system doesn't know which items are Bengali. You cannot surface "biryani near you" if the system can't distinguish the dish biryani from the restaurant that happens to have "biryani" in its name. The data had to be fixed before any of the roadmap above it could work.

The state of the data
No shared taxonomy
Menu items existed as free-text strings. No cuisine field. No dish category. No normalised naming. "Chicken Biryani", "chkn biryani", "Biriyani (chicken)" - all the same dish, zero way to know that programmatically.
The product ceiling
Discovery was keyword-only
Search and collections worked on exact string matching. "Biryani" returned items containing the word. It couldn't return semantically related items, couldn't surface by cuisine intent, and couldn't personalise by food preference.
The AI blocker
No features to train on
Every ML model Pathao Food wanted to build - recommendation, cuisine affinity, dietary preference detection - required structured labels as features. Without them, there was nothing for a model to learn from.
Strategic context: This was the explicit prerequisite for Pathao Food's AI roadmap. No structured food taxonomy = no cuisine affinity model = no personalised discovery = no AI-powered recommendation engine. Dish cleaning was not a data ops task. It was the foundation layer for every intelligent product capability on the roadmap.

02 - Problem Deep Dive

What the data actually looked like

Core problem
"A food delivery platform with 100,000+ menu items and no structured answer to the question: what cuisine is this, and what dish is this?" Without those two labels, every intelligent feature - search ranking, cuisine filters, personalisation, dietary flags - was either impossible or unreliable.

Five categories of data rot

Problem type 1
Name inconsistency
"Chicken Biryani", "Biriyani Chicken", "chkn briyani", "Special Chicken Biriyani" - four names for one dish. No canonical form, no deduplication signal, no way to group them.
Problem type 2
Cuisine-as-dish confusion
Restaurants tagged items as "Chinese", "Thai", "Mughlai" - which are cuisines, not dishes. The dish field was empty or incorrect. The platform couldn't distinguish what kind of food an item was vs. where it came from.
Problem type 3
Missing classification
Thousands of items had no cuisine or dish label at all. The onboarding flow didn't enforce classification. Restaurants skipped optional fields entirely, and there was no validation on submission.
Problem type 4
Banglish and script mixing
"Khichuri", "khichuri", "Khichri", "Khichudi" - the same Bengali dish in different romanisation schemes. "ভাত" alongside "Rice". Mixed-language entries that no keyword matcher could reliably normalise.
Problem type 5
Fusion ambiguity
"Tandoori Chicken Pizza" - is this Indian? Italian? Both? Items at the intersection of two cuisines had no clear classification path, and existing taxonomies provided no guidance for these cases.

Why this broke product features downstream

Before - what the data looked like
Search for "biryani" returns 400+ results with no ranking signal - keyword match only, relevance undefined
Cuisine filter "Bengali" returns nothing - no item is labelled as Bengali cuisine in the database
Personalisation engine has no food-type features - it can only learn from what restaurant a user ordered from, not what they actually ate
"Thai Soup" classified as cuisine: Thai, dish: empty - the dish field that recommendation models need doesn't exist
New restaurant onboarding creates unclassified items by default - data debt compounds with every new partner
After - what the data enables
Search for "biryani" returns results ranked by dish-type match, cuisine affinity, and user preference history
Cuisine filter "Bengali" surfaces all items with cuisine: Bengali - Tehari, Ilish, Khichuri, Fuchka, cross-restaurant
Personalisation model can learn cuisine affinity and dish-type preference - "this user orders Bengali dishes at lunch"
"Thai Soup" correctly labelled cuisine: Thai, dish: Soup - both fields populated and machine-readable
Onboarding now enforces classification - new restaurant items enter the system clean from day one

03 - Taxonomy Design

The two-layer model: Cuisine and Dish

The first design decision was the most important: every menu item needs exactly two labels - a Cuisine label and a Dish label. These are not interchangeable. They answer different questions and serve different product purposes. Getting this distinction right was the foundational intellectual work of the entire project.

Cuisine - the cooking tradition or regional origin of the food. It tells you where the food style comes from. A cuisine can contain hundreds of dishes. Examples: Bengali, Thai, Chinese, Mughlai, Continental, Fusion.

Dish - the specific food item being prepared and served. It tells you what the food is. A dish belongs to one or more cuisines. Examples: Biryani, Soup, Shawarma, Fried Rice, Naan, Risotto.

The taxonomy structure

Pathao Food item taxonomy - two mandatory labels per item

Bengali
Tehari
Ilish Bhaji
Fuchka
Khichuri
Panta Ilish
Thai
Soup
Curry
Fried Rice
Noodles
Mughlai
Biryani
Rezala
Korma
Naan
Fusion
Tandoori Pizza
Korean BBQ Burger
Desi Pasta

Cuisine taxonomy - defined categories

Regional / Heritage
Origin-based cuisines
Bengali, Indian, Mughlai, Thai, Chinese, Japanese, Korean, Italian, Middle Eastern, Mexican, Continental. Each maps to a geographic or cultural cooking tradition.
Style-based
Cooking-style cuisines
Fast Food, Street Food, BBQ & Grill. Defined not by geography but by preparation style and consumption context - applicable across cultural origins.
Cross-origin
Fusion cuisine
Items that intentionally blend two distinct culinary traditions. "Tandoori Chicken Pizza" is neither purely Italian nor purely Indian - it is Fusion. A specific, defined category, not a catch-all.

Dish taxonomy - defined categories

Rice-based
Biryani, Tehari, Fried Rice, Khichuri, Pilau, Panta
Rice as the primary ingredient. Preparation method and accompanying items define the specific dish type.
Bread-based
Naan, Paratha, Roti, Sandwich, Wrap, Toast
Bread or dough as the primary vehicle. Includes flatbreads and western sandwich forms.
Protein-based
Curry, Roast, Grilled, Tandoori, Kebab, Rezala, Bhuna
Meat, seafood, or legume as the core. Cooking method (grilled vs. curried vs. roasted) defines the sub-category.
Noodle / Pasta
Noodles, Chow Mein, Ramen, Pasta, Risotto
Carbohydrate-based dishes where noodle or pasta is the primary element, distinct from rice-based dishes.
Snack / Street
Fuchka, Spring Roll, Samosa, Shawarma, Tacos
Portable, hand-held, or street-food-context items. Defined by consumption context as much as preparation.
Liquid / Bowl
Soup, Salad, Haleem, Dal
Liquid-primary or loose-form dishes. Bowl-format items consumed with a spoon rather than hands or bread.

04 - Classification Rules

The decision logic that makes the taxonomy consistent

A taxonomy is only as good as the rules that enforce it. Anyone labelling an item - whether an ops team member, a restaurant owner, or a machine learning model - needs an unambiguous decision path. I designed a four-question decision tree and a set of conflict-resolution rules for every edge case.

The four-question decision tree

Q1
Does the name refer to a region, country, or cooking tradition?
If yes → it's a Cuisine identifier, not a Dish name. "Thai", "Chinese", "Bengali", "Mughlai", "Italian" are cuisines. They go in the Cuisine field. The Dish field still needs to be filled separately. Example: "Thai Soup" → Cuisine: Thai, Dish: Soup.
Q2
Does the name describe a specific preparation that can be eaten as a complete item?
If yes → it contains a Dish identifier. Look for action words: fried, grilled, curry, soup, roll, wrap, biryani, tehari. These indicate what the food is. "Chicken Biryani" → Dish: Biryani. "Grilled Chicken Salad" → Dish: Grilled Chicken Salad (Continental).
Q3
Does the name contain both a cuisine signal AND a dish signal?
Split them. "Korean Kimchi Fried Rice" → Cuisine: Korean, Dish: Kimchi Fried Rice. "Mughlai Chicken" → Cuisine: Mughlai, Dish: Chicken (curry - infer from context). The dish label captures what it is; the cuisine label captures where it's from.
Q4
Does the name combine elements from two distinct culinary traditions intentionally?
If yes → Cuisine: Fusion. This is the catch for deliberate hybrids - "Tandoori Chicken Pizza", "Desi Pasta", "Korean BBQ Burger". Fusion is not "I don't know the cuisine" - it's a positive classification for a specific type of cross-cultural dish.

Classification examples - applied to real menu data

Menu item as entered Dish name Cuisine Classification signal
Chicken Biryani Biryani Indian / Mughlai "Biryani" is the dish. Indian subcontinent origin.
Beef Tehari Tehari Bengali "Tehari" is a specific Bengali rice dish with beef.
Thai Soup Soup Thai "Thai" = cuisine. "Soup" = dish. Both extracted from the name.
Chicken Pasta Alfredo Pasta Alfredo Italian Alfredo = Italian preparation. Pasta = dish type.
Vegetable Fried Rice Fried Rice Chinese "Fried Rice" = dish. Chinese cooking method and flavour profile.
Chicken Shawarma Wrap Shawarma Middle Eastern "Shawarma" is a named Middle Eastern dish. "Wrap" is the format.
Mutton Rezala Rezala Mughlai "Rezala" is a defined Mughlai curry preparation. Not generic curry.
Fuchka with Tamarind Water Fuchka Bengali "Fuchka" is a specific Bengali street food. Tamarind water is the accompaniment.
Tandoori Chicken Pizza Pizza Fusion Italian dish format (pizza) + Indian topping style (tandoori). Deliberate fusion.
Egg Toast Sandwich Sandwich Continental Sandwiches are Western/Continental. Egg toast = preparation variant.
Panta Ilish Panta Ilish Bengali Fermented rice with hilsha - a complete Bengali cultural dish. Name = dish name.
Creamy Mushroom Risotto Risotto Italian "Risotto" is a named Italian dish. Mushroom is the variant.
chkn briyani Biryani Indian / Mughlai Typo-normalised to canonical form "Chicken Biryani". Classified as above.
Korean Kimchi Fried Rice Kimchi Fried Rice Korean "Korean" = cuisine. "Kimchi Fried Rice" = specific dish. Both extracted.

05 - Cleaning Process

How 100,000+ items were cleaned at scale

The taxonomy design solved the "what to label" question. The process design solved the "how to label at scale" question. A purely manual process would have taken months and produced inconsistent results across labellers. A purely automated process would have missed the ambiguous cases that require contextual judgement. The answer was a hybrid pipeline.

1
Audit - understand the scope before designing the solution
Pulled a stratified sample of 2,000 items across restaurant types, cities, and cuisine categories. Classified each manually against a draft taxonomy. This audit revealed the five problem types (name inconsistency, cuisine-as-dish confusion, missing labels, Banglish variation, fusion ambiguity) and shaped the final taxonomy design. The process was designed around what the data actually looked like - not an assumed ideal state.
2
Rule-based auto-classification for high-confidence items
Built a rule engine covering the most common patterns - items containing "Biryani", "Tehari", "Fried Rice", "Shawarma", "Ramen", "Pizza" and their known Banglish variants. These patterns covered ~60% of the item catalogue with high confidence. Auto-labelled with a confidence score, flagged for spot-check rather than full manual review.
3
Labeller guidelines and training for human review of ambiguous items
Created a labelling guide (the taxonomy document shared with ops) with the four-question decision tree, 20+ worked examples across cuisine and dish types, explicit handling instructions for Banglish names, and a Fusion classification checklist. Ops team trained on the guide before labelling the ambiguous ~40% of items that the rule engine could not classify with confidence.
4
Banglish normalisation pass
Separate pass specifically for Banglish variants - mapping "khichudi", "khichri", "Khichuri", "ক্ষিচুড়ি" to the canonical dish label "Khichuri". This normalisation table was built collaboratively with Bangladeshi team members who had domain knowledge of regional spelling variants. The canonical form was set as the searchable and displayable label.
5
QA - inter-labeller agreement check
A 5% random sample of manually labelled items was re-labelled by a second labeller blind to the first label. Agreement rate measured. Items with disagreement were escalated for a tie-breaking review. This produced both a quality metric for the cleaned dataset and identified ambiguous categories that needed additional rule clarification.
6
Onboarding enforcement - prevent new data rot
Modified the restaurant onboarding flow to make Cuisine and Dish classification mandatory fields with a constrained dropdown (not free text). New items now enter the system pre-classified. The dropdown options are the taxonomy categories. Restaurants can request a new category via a flagging mechanism, reviewed monthly by the product team.

06 - Edge Cases & Hard Decisions

The cases that challenged the taxonomy

A taxonomy is only trustworthy if it has documented answers for the hard cases. These were the most contested classification decisions, and the reasoning behind each one.

Hard case 1
Butter Naan with Chicken Korma
Decision: Cuisine: Mughlai. Dish: Chicken Korma. Naan is the accompaniment, not the primary dish. The meal-defining item is the Korma. Accompaniments are not classified as the dish when a more specific protein/main dish is named alongside them.
Hard case 2
Prawn Curry
Decision: Dish: Curry. Cuisine: depends on preparation - mustard-based prawn curry = Bengali. Coconut-based = Indian. When the preparation style is unspecified, cuisine is assigned based on the restaurant's primary cuisine profile. Context wins over name alone.
Hard case 3
Paneer Butter Masala
Decision: Cuisine: Indian (North Indian specifically). Dish: Paneer Butter Masala - the full name is the dish name here, as it is a named and widely recognised specific curry. Not just "Curry". The specificity of the named preparation takes precedence.
Hard case 4
Falafel Wrap with Hummus
Decision: Cuisine: Middle Eastern. Dish: Falafel Wrap. Hummus is the accompaniment. "Falafel Wrap" is the primary food item. The "with X" construction always subordinates the second item to an accompaniment role - it does not change the dish classification.
Hard case 5
Fast Food classification
Decision: Fast Food is a style-based cuisine, not a residual category. "Hot N' Crispy Chicken" = Cuisine: Fast Food, Dish: Fried Chicken. Fast Food has its own preparation conventions (deep-fried, battered, quick-serve) that constitute a distinct culinary identity regardless of origin.
Hard case 6
Pizza - Italian or Fast Food?
Decision: Origin = Italian. But for most Bangladeshi menu context, Pizza is served as Fast Food. Classification uses restaurant context as the tiebreaker - a pizza from a traditional Italian restaurant = Italian. A pizza from a fast food cloud kitchen = Fast Food. Context over strict etymology.

07 - AI & Product Downstream

What this data layer unlocks

The cleaned taxonomy is not a finished product - it is infrastructure. Its value is entirely in what becomes possible on top of it. This is the concrete list of product capabilities that were either unblocked or significantly improved by the existence of clean, structured food data.

🧠
Cuisine Affinity Model
Train a user-level cuisine preference model: "this user orders Bengali cuisine 60% of the time at lunch." Cuisine label is the primary feature. Impossible without it.
Enabled by this work
🔍
Intent-Aware Search
Search for "biryani" returns all items with Dish: Biryani - regardless of how the restaurant named it. Semantic search grounded in taxonomy, not keyword matching.
Enabled by this work
🗂️
Cuisine-Based Collections
Homepage collections like "Bengali Classics" or "Quick Bites" can now be generated automatically from cuisine and dish labels - not manually curated lists that go stale.
Enabled by this work
🎯
Personalised Discovery
Surface restaurants to a user based on their cuisine and dish-type preference history - not just proximity or popularity. "You usually order Thai on Fridays - here are options near you."
Enabled by this work
🥗
Dietary Preference Detection
Dish labels enable vegetarian/vegan/halal detection by type. "Vegetable Spring Roll" (Dish: Spring Roll) + ingredient context = reliably flagged as vegetarian. Protein labels enable meat-type filtering.
Planned - next phase
📊
Restaurant Benchmarking
Compare a restaurant's menu coverage against cuisine peers. "Your Bengali menu has 12 items but competitors average 24 - here are the most-ordered missing dishes in your category." Ops and partner intelligence tool.
Planned - next phase
The AI compounding effect: Each new piece of user behaviour data (an order placed, a search run, a collection tapped) is now tagged with structured cuisine and dish labels at the moment it's generated. This means the training dataset for every future model grows richer with every order placed - structured labels accumulate automatically. The taxonomy doesn't just unlock current features; it continuously improves the quality of all future models.

08 - Success Metrics

How success was defined and measured

Coverage
% of active menu items with both Cuisine + Dish labels
95%+
Target coverage of active item catalogue
Quality
Inter-labeller agreement rate on QA sample
90%+
Agreement threshold for taxonomy trustworthiness
Prevention
New items entering system without classification
0%
Post-onboarding enforcement launch
Search impact
Cuisine-filtered search result relevance score
Baseline → track
Pre/post taxonomy deployment comparison
AI readiness
Feature completeness for cuisine affinity model
100%
All required training features available post-cleaning
Maintenance
Monthly unclassified item backlog
< 500
Ongoing - reviewed and cleared monthly

09 - Risks & Mitigations

What could go wrong

High Risk
Taxonomy drift - categories become stale as menus evolve
New food trends (Korean fried chicken, cloud kitchen fusions) emerge faster than taxonomy review cycles. Mitigated by a monthly new-category review process and a restaurant-facing "flag unknown category" mechanism that surfaces gaps proactively.
High Risk
Labeller inconsistency at scale
Human labellers make different judgements on ambiguous items without explicit guidance. Mitigated by the decision tree documentation, worked examples for every edge case type, and the inter-labeller QA check that catches systematic disagreements before they propagate.
Medium Risk
Restaurant resistance to classification enforcement
Restaurant owners may find the mandatory classification step friction in onboarding. Mitigated by making the dropdown UI fast (type-ahead search), pre-populating common choices based on restaurant name/type, and framing classification as a discoverability benefit for the restaurant - "your items appear in cuisine-filtered searches".
Medium Risk
Banglish coverage gaps
The Banglish normalisation table covers known variants but will miss new or hyperlocal spelling forms. Mitigated by logging unmatched search queries - a user searching "khichudi" who gets zero results surfaces a coverage gap that feeds back into the normalisation dictionary.
Low Risk
AI model overfits to early taxonomy choices
If the taxonomy is later revised, models trained on old labels need retraining. Mitigated by versioning taxonomy definitions and logging the taxonomy version against every labelled item - model retraining scope is queryable when the taxonomy changes.
Low Risk
Over-classification of Fusion
"Fusion" becoming a catch-all for "I'm not sure" rather than a positive classification of genuine hybrids. Mitigated by the explicit Fusion checklist (must meet both criteria: deliberate blend, two identifiable distinct culinary traditions) and the requirement that labellers cannot select Fusion without marking both parent cuisines.

10 - Lessons & Reflections

What this work taught me about data as a product

01
Data quality is a product problem, not a data engineering problem
The dish cleaning initiative only happened because a PM owned it. Data teams can build pipelines; they cannot define what "correct" means for a food taxonomy. The definition of correct is a product decision - it depends on how the data will be used, what features it needs to power, and what user experiences it needs to enable. A PM has to own that definition or it won't be made correctly.
02
The taxonomy is a product, and it needs a PRD
The labelling guide I wrote for ops wasn't just documentation - it was a product specification. Every edge case decision in it was a product decision with downstream consequences. Treating the taxonomy document with the same rigour as a feature PRD - with explicit decision logic, worked examples, and conflict-resolution rules - was what made the cleaned data trustworthy enough to build AI features on.
03
Prevention is worth more than cleaning
The 100,000-item backlog took months to clean. The onboarding enforcement change that prevents new unclassified items takes minutes per new restaurant and compounds infinitely. The ratio of investment to return is completely inverted between retrospective cleaning and prospective enforcement. Cleaning the backlog was necessary; the onboarding fix was the real long-term solution.
04
Edge cases are not exceptions - they are the rule
In food data, the "easy" cases (Chicken Biryani → Mughlai / Biryani) are a minority. The majority of decisions involve regional variants, Banglish spelling, ambiguous fusion items, accompaniment-vs-main confusion, and context-dependent cuisine assignment. A taxonomy that only handles the easy cases is a taxonomy that produces unreliable data. The value is in the documented edge case decisions.
05
Structured data is the moat, not the model
Any company can access the same ML frameworks and foundation models. What they cannot copy is years of structured, labelled, domain-specific food data built on a well-designed taxonomy with Bangladeshi culinary context encoded into every decision. The real competitive advantage in AI is the training data quality, not the model architecture. This is what dish cleaning was actually building - a data moat.