Pathao Food Dish Data Cleaning - Case Study

01 - Context & Background

Pathao Food had a restaurant graph. It didn't have a food graph.

By 2024, Pathao Food had thousands of restaurant partners across Dhaka, Chittagong, and Kathmandu. Each restaurant had uploaded their own menu - item names, prices, descriptions, and photos. The operational infrastructure was solid. The data underneath it was not.

Every menu item had been entered by individual restaurant owners or onboarding ops teams, each with their own conventions, languages, and levels of care. The result was a database of over 100,000 menu items with no shared schema. There was no standard way to answer the most basic question a food discovery product needs to answer: what kind of food is this?

This wasn't just an ops cleanliness problem. It was a structural ceiling on every AI and personalisation initiative Pathao Food wanted to build. You cannot recommend "Bengali cuisine" to a user if the system doesn't know which items are Bengali. You cannot surface "biryani near you" if the system can't distinguish the dish biryani from the restaurant that happens to have "biryani" in its name. The data had to be fixed before any of the roadmap above it could work.

The state of the data

No shared taxonomy

Menu items existed as free-text strings. No cuisine field. No dish category. No normalised naming. "Chicken Biryani", "chkn biryani", "Biriyani (chicken)" - all the same dish, zero way to know that programmatically.

The product ceiling

Discovery was keyword-only

Search and collections worked on exact string matching. "Biryani" returned items containing the word. It couldn't return semantically related items, couldn't surface by cuisine intent, and couldn't personalise by food preference.

The AI blocker

No features to train on

Every ML model Pathao Food wanted to build - recommendation, cuisine affinity, dietary preference detection - required structured labels as features. Without them, there was nothing for a model to learn from.

Strategic context: This was the explicit prerequisite for Pathao Food's AI roadmap. No structured food taxonomy = no cuisine affinity model = no personalised discovery = no AI-powered recommendation engine. Dish cleaning was not a data ops task. It was the foundation layer for every intelligent product capability on the roadmap.

02 - Problem Deep Dive

What the data actually looked like

Core problem

"A food delivery platform with 100,000+ menu items and no structured answer to the question: what cuisine is this, and what dish is this?" Without those two labels, every intelligent feature - search ranking, cuisine filters, personalisation, dietary flags - was either impossible or unreliable.

Five categories of data rot

Problem type 1

Name inconsistency

"Chicken Biryani", "Biriyani Chicken", "chkn briyani", "Special Chicken Biriyani" - four names for one dish. No canonical form, no deduplication signal, no way to group them.

Problem type 2

Cuisine-as-dish confusion

Restaurants tagged items as "Chinese", "Thai", "Mughlai" - which are cuisines, not dishes. The dish field was empty or incorrect. The platform couldn't distinguish what kind of food an item was vs. where it came from.

Problem type 3

Missing classification

Thousands of items had no cuisine or dish label at all. The onboarding flow didn't enforce classification. Restaurants skipped optional fields entirely, and there was no validation on submission.

Problem type 4

Banglish and script mixing

"Khichuri", "khichuri", "Khichri", "Khichudi" - the same Bengali dish in different romanisation schemes. "ভাত" alongside "Rice". Mixed-language entries that no keyword matcher could reliably normalise.

Problem type 5

Fusion ambiguity

"Tandoori Chicken Pizza" - is this Indian? Italian? Both? Items at the intersection of two cuisines had no clear classification path, and existing taxonomies provided no guidance for these cases.

Why this broke product features downstream

Before - what the data looked like

✗

Search for "biryani" returns 400+ results with no ranking signal - keyword match only, relevance undefined

✗

Cuisine filter "Bengali" returns nothing - no item is labelled as Bengali cuisine in the database

✗

Personalisation engine has no food-type features - it can only learn from what restaurant a user ordered from, not what they actually ate

✗

"Thai Soup" classified as cuisine: Thai, dish: empty - the dish field that recommendation models need doesn't exist

✗

New restaurant onboarding creates unclassified items by default - data debt compounds with every new partner

After - what the data enables

✓

Search for "biryani" returns results ranked by dish-type match, cuisine affinity, and user preference history

✓

Cuisine filter "Bengali" surfaces all items with cuisine: Bengali - Tehari, Ilish, Khichuri, Fuchka, cross-restaurant

✓

Personalisation model can learn cuisine affinity and dish-type preference - "this user orders Bengali dishes at lunch"

✓

"Thai Soup" correctly labelled cuisine: Thai, dish: Soup - both fields populated and machine-readable

✓

Onboarding now enforces classification - new restaurant items enter the system clean from day one

03 - Taxonomy Design

The two-layer model: Cuisine and Dish

The first design decision was the most important: every menu item needs exactly two labels - a Cuisine label and a Dish label. These are not interchangeable. They answer different questions and serve different product purposes. Getting this distinction right was the foundational intellectual work of the entire project.

Cuisine - the cooking tradition or regional origin of the food. It tells you where the food style comes from. A cuisine can contain hundreds of dishes. Examples: Bengali, Thai, Chinese, Mughlai, Continental, Fusion.

Dish - the specific food item being prepared and served. It tells you what the food is. A dish belongs to one or more cuisines. Examples: Biryani, Soup, Shawarma, Fried Rice, Naan, Risotto.

The taxonomy structure

Pathao Food item taxonomy - two mandatory labels per item

Bengali

Tehari

Ilish Bhaji

Fuchka

Khichuri

Panta Ilish

Thai

Soup

Curry

Fried Rice

Noodles

Mughlai

Biryani

Rezala

Korma

Naan

Fusion

Tandoori Pizza

Korean BBQ Burger

Desi Pasta

Cuisine taxonomy - defined categories

Regional / Heritage

Origin-based cuisines

Bengali, Indian, Mughlai, Thai, Chinese, Japanese, Korean, Italian, Middle Eastern, Mexican, Continental. Each maps to a geographic or cultural cooking tradition.

Style-based

Cooking-style cuisines

Fast Food, Street Food, BBQ & Grill. Defined not by geography but by preparation style and consumption context - applicable across cultural origins.

Cross-origin

Fusion cuisine

Items that intentionally blend two distinct culinary traditions. "Tandoori Chicken Pizza" is neither purely Italian nor purely Indian - it is Fusion. A specific, defined category, not a catch-all.

Dish taxonomy - defined categories

Rice-based

Biryani, Tehari, Fried Rice, Khichuri, Pilau, Panta

Rice as the primary ingredient. Preparation method and accompanying items define the specific dish type.

Bread-based

Naan, Paratha, Roti, Sandwich, Wrap, Toast

Bread or dough as the primary vehicle. Includes flatbreads and western sandwich forms.

Protein-based

Curry, Roast, Grilled, Tandoori, Kebab, Rezala, Bhuna

Meat, seafood, or legume as the core. Cooking method (grilled vs. curried vs. roasted) defines the sub-category.

Noodle / Pasta

Noodles, Chow Mein, Ramen, Pasta, Risotto

Carbohydrate-based dishes where noodle or pasta is the primary element, distinct from rice-based dishes.

Snack / Street

Fuchka, Spring Roll, Samosa, Shawarma, Tacos

Portable, hand-held, or street-food-context items. Defined by consumption context as much as preparation.

Liquid / Bowl

Soup, Salad, Haleem, Dal

Liquid-primary or loose-form dishes. Bowl-format items consumed with a spoon rather than hands or bread.

04 - Classification Rules

The decision logic that makes the taxonomy consistent

A taxonomy is only as good as the rules that enforce it. Anyone labelling an item - whether an ops team member, a restaurant owner, or a machine learning model - needs an unambiguous decision path. I designed a four-question decision tree and a set of conflict-resolution rules for every edge case.

The four-question decision tree

Q1

Does the name refer to a region, country, or cooking tradition?

If yes → it's a Cuisine identifier, not a Dish name. "Thai", "Chinese", "Bengali", "Mughlai", "Italian" are cuisines. They go in the Cuisine field. The Dish field still needs to be filled separately. Example: "Thai Soup" → Cuisine: Thai, Dish: Soup.

Q2

Does the name describe a specific preparation that can be eaten as a complete item?

If yes → it contains a Dish identifier. Look for action words: fried, grilled, curry, soup, roll, wrap, biryani, tehari. These indicate what the food is. "Chicken Biryani" → Dish: Biryani. "Grilled Chicken Salad" → Dish: Grilled Chicken Salad (Continental).

Q3

Does the name contain both a cuisine signal AND a dish signal?

Split them. "Korean Kimchi Fried Rice" → Cuisine: Korean, Dish: Kimchi Fried Rice. "Mughlai Chicken" → Cuisine: Mughlai, Dish: Chicken (curry - infer from context). The dish label captures what it is; the cuisine label captures where it's from.

Q4

Does the name combine elements from two distinct culinary traditions intentionally?

If yes → Cuisine: Fusion. This is the catch for deliberate hybrids - "Tandoori Chicken Pizza", "Desi Pasta", "Korean BBQ Burger". Fusion is not "I don't know the cuisine" - it's a positive classification for a specific type of cross-cultural dish.

Classification examples - applied to real menu data

Menu item as entered	Dish name	Cuisine	Classification signal
Chicken Biryani	Biryani	Indian / Mughlai	"Biryani" is the dish. Indian subcontinent origin.
Beef Tehari	Tehari	Bengali	"Tehari" is a specific Bengali rice dish with beef.
Thai Soup	Soup	Thai	"Thai" = cuisine. "Soup" = dish. Both extracted from the name.
Chicken Pasta Alfredo	Pasta Alfredo	Italian	Alfredo = Italian preparation. Pasta = dish type.
Vegetable Fried Rice	Fried Rice	Chinese	"Fried Rice" = dish. Chinese cooking method and flavour profile.
Chicken Shawarma Wrap	Shawarma	Middle Eastern	"Shawarma" is a named Middle Eastern dish. "Wrap" is the format.
Mutton Rezala	Rezala	Mughlai	"Rezala" is a defined Mughlai curry preparation. Not generic curry.
Fuchka with Tamarind Water	Fuchka	Bengali	"Fuchka" is a specific Bengali street food. Tamarind water is the accompaniment.
Tandoori Chicken Pizza	Pizza	Fusion	Italian dish format (pizza) + Indian topping style (tandoori). Deliberate fusion.
Egg Toast Sandwich	Sandwich	Continental	Sandwiches are Western/Continental. Egg toast = preparation variant.
Panta Ilish	Panta Ilish	Bengali	Fermented rice with hilsha - a complete Bengali cultural dish. Name = dish name.
Creamy Mushroom Risotto	Risotto	Italian	"Risotto" is a named Italian dish. Mushroom is the variant.
chkn briyani	Biryani	Indian / Mughlai	Typo-normalised to canonical form "Chicken Biryani". Classified as above.
Korean Kimchi Fried Rice	Kimchi Fried Rice	Korean	"Korean" = cuisine. "Kimchi Fried Rice" = specific dish. Both extracted.

05 - Cleaning Process

How 100,000+ items were cleaned at scale

The taxonomy design solved the "what to label" question. The process design solved the "how to label at scale" question. A purely manual process would have taken months and produced inconsistent results across labellers. A purely automated process would have missed the ambiguous cases that require contextual judgement. The answer was a hybrid pipeline.

1

Audit - understand the scope before designing the solution

Pulled a stratified sample of 2,000 items across restaurant types, cities, and cuisine categories. Classified each manually against a draft taxonomy. This audit revealed the five problem types (name inconsistency, cuisine-as-dish confusion, missing labels, Banglish variation, fusion ambiguity) and shaped the final taxonomy design. The process was designed around what the data actually looked like - not an assumed ideal state.

2

Rule-based auto-classification for high-confidence items

Built a rule engine covering the most common patterns - items containing "Biryani", "Tehari", "Fried Rice", "Shawarma", "Ramen", "Pizza" and their known Banglish variants. These patterns covered ~60% of the item catalogue with high confidence. Auto-labelled with a confidence score, flagged for spot-check rather than full manual review.

3

Labeller guidelines and training for human review of ambiguous items

Created a labelling guide (the taxonomy document shared with ops) with the four-question decision tree, 20+ worked examples across cuisine and dish types, explicit handling instructions for Banglish names, and a Fusion classification checklist. Ops team trained on the guide before labelling the ambiguous ~40% of items that the rule engine could not classify with confidence.

4

Banglish normalisation pass

Separate pass specifically for Banglish variants - mapping "khichudi", "khichri", "Khichuri", "ক্ষিচুড়ি" to the canonical dish label "Khichuri". This normalisation table was built collaboratively with Bangladeshi team members who had domain knowledge of regional spelling variants. The canonical form was set as the searchable and displayable label.

5

QA - inter-labeller agreement check

A 5% random sample of manually labelled items was re-labelled by a second labeller blind to the first label. Agreement rate measured. Items with disagreement were escalated for a tie-breaking review. This produced both a quality metric for the cleaned dataset and identified ambiguous categories that needed additional rule clarification.

6

Onboarding enforcement - prevent new data rot

Modified the restaurant onboarding flow to make Cuisine and Dish classification mandatory fields with a constrained dropdown (not free text). New items now enter the system pre-classified. The dropdown options are the taxonomy categories. Restaurants can request a new category via a flagging mechanism, reviewed monthly by the product team.

06 - Edge Cases & Hard Decisions

The cases that challenged the taxonomy

A taxonomy is only trustworthy if it has documented answers for the hard cases. These were the most contested classification decisions, and the reasoning behind each one.

Hard case 1

Butter Naan with Chicken Korma

Decision: Cuisine: Mughlai. Dish: Chicken Korma. Naan is the accompaniment, not the primary dish. The meal-defining item is the Korma. Accompaniments are not classified as the dish when a more specific protein/main dish is named alongside them.

Hard case 2

Prawn Curry

Decision: Dish: Curry. Cuisine: depends on preparation - mustard-based prawn curry = Bengali. Coconut-based = Indian. When the preparation style is unspecified, cuisine is assigned based on the restaurant's primary cuisine profile. Context wins over name alone.

Hard case 3

Paneer Butter Masala

Decision: Cuisine: Indian (North Indian specifically). Dish: Paneer Butter Masala - the full name is the dish name here, as it is a named and widely recognised specific curry. Not just "Curry". The specificity of the named preparation takes precedence.

Hard case 4

Falafel Wrap with Hummus

Decision: Cuisine: Middle Eastern. Dish: Falafel Wrap. Hummus is the accompaniment. "Falafel Wrap" is the primary food item. The "with X" construction always subordinates the second item to an accompaniment role - it does not change the dish classification.

Hard case 5

Fast Food classification

Decision: Fast Food is a style-based cuisine, not a residual category. "Hot N' Crispy Chicken" = Cuisine: Fast Food, Dish: Fried Chicken. Fast Food has its own preparation conventions (deep-fried, battered, quick-serve) that constitute a distinct culinary identity regardless of origin.

Hard case 6

Pizza - Italian or Fast Food?

Decision: Origin = Italian. But for most Bangladeshi menu context, Pizza is served as Fast Food. Classification uses restaurant context as the tiebreaker - a pizza from a traditional Italian restaurant = Italian. A pizza from a fast food cloud kitchen = Fast Food. Context over strict etymology.

07 - AI & Product Downstream

What this data layer unlocks

The cleaned taxonomy is not a finished product - it is infrastructure. Its value is entirely in what becomes possible on top of it. This is the concrete list of product capabilities that were either unblocked or significantly improved by the existence of clean, structured food data.

🧠

Cuisine Affinity Model

Train a user-level cuisine preference model: "this user orders Bengali cuisine 60% of the time at lunch." Cuisine label is the primary feature. Impossible without it.

Enabled by this work

🔍

Intent-Aware Search

Search for "biryani" returns all items with Dish: Biryani - regardless of how the restaurant named it. Semantic search grounded in taxonomy, not keyword matching.

Enabled by this work

🗂️

Cuisine-Based Collections

Homepage collections like "Bengali Classics" or "Quick Bites" can now be generated automatically from cuisine and dish labels - not manually curated lists that go stale.

Enabled by this work

🎯

Personalised Discovery

Surface restaurants to a user based on their cuisine and dish-type preference history - not just proximity or popularity. "You usually order Thai on Fridays - here are options near you."

Enabled by this work

🥗

Dietary Preference Detection

Dish labels enable vegetarian/vegan/halal detection by type. "Vegetable Spring Roll" (Dish: Spring Roll) + ingredient context = reliably flagged as vegetarian. Protein labels enable meat-type filtering.

Planned - next phase

📊

Restaurant Benchmarking

Compare a restaurant's menu coverage against cuisine peers. "Your Bengali menu has 12 items but competitors average 24 - here are the most-ordered missing dishes in your category." Ops and partner intelligence tool.

Planned - next phase

The AI compounding effect: Each new piece of user behaviour data (an order placed, a search run, a collection tapped) is now tagged with structured cuisine and dish labels at the moment it's generated. This means the training dataset for every future model grows richer with every order placed - structured labels accumulate automatically. The taxonomy doesn't just unlock current features; it continuously improves the quality of all future models.

08 - Success Metrics

How success was defined and measured

Coverage

% of active menu items with both Cuisine + Dish labels

95%+

Target coverage of active item catalogue

Quality

Inter-labeller agreement rate on QA sample

90%+

Agreement threshold for taxonomy trustworthiness

Prevention

New items entering system without classification

0%

Post-onboarding enforcement launch

Search impact

Cuisine-filtered search result relevance score

Baseline → track

Pre/post taxonomy deployment comparison

AI readiness

Feature completeness for cuisine affinity model

100%

All required training features available post-cleaning

Maintenance

Monthly unclassified item backlog

< 500

Ongoing - reviewed and cleared monthly

09 - Risks & Mitigations

What could go wrong

High Risk

Taxonomy drift - categories become stale as menus evolve

New food trends (Korean fried chicken, cloud kitchen fusions) emerge faster than taxonomy review cycles. Mitigated by a monthly new-category review process and a restaurant-facing "flag unknown category" mechanism that surfaces gaps proactively.

High Risk

Labeller inconsistency at scale

Human labellers make different judgements on ambiguous items without explicit guidance. Mitigated by the decision tree documentation, worked examples for every edge case type, and the inter-labeller QA check that catches systematic disagreements before they propagate.

Medium Risk

Restaurant resistance to classification enforcement

Restaurant owners may find the mandatory classification step friction in onboarding. Mitigated by making the dropdown UI fast (type-ahead search), pre-populating common choices based on restaurant name/type, and framing classification as a discoverability benefit for the restaurant - "your items appear in cuisine-filtered searches".

Medium Risk

Banglish coverage gaps

The Banglish normalisation table covers known variants but will miss new or hyperlocal spelling forms. Mitigated by logging unmatched search queries - a user searching "khichudi" who gets zero results surfaces a coverage gap that feeds back into the normalisation dictionary.

Low Risk

AI model overfits to early taxonomy choices

If the taxonomy is later revised, models trained on old labels need retraining. Mitigated by versioning taxonomy definitions and logging the taxonomy version against every labelled item - model retraining scope is queryable when the taxonomy changes.

Low Risk

Over-classification of Fusion

"Fusion" becoming a catch-all for "I'm not sure" rather than a positive classification of genuine hybrids. Mitigated by the explicit Fusion checklist (must meet both criteria: deliberate blend, two identifiable distinct culinary traditions) and the requirement that labellers cannot select Fusion without marking both parent cuisines.

10 - Lessons & Reflections

What this work taught me about data as a product

01

Data quality is a product problem, not a data engineering problem

The dish cleaning initiative only happened because a PM owned it. Data teams can build pipelines; they cannot define what "correct" means for a food taxonomy. The definition of correct is a product decision - it depends on how the data will be used, what features it needs to power, and what user experiences it needs to enable. A PM has to own that definition or it won't be made correctly.

02

The taxonomy is a product, and it needs a PRD

The labelling guide I wrote for ops wasn't just documentation - it was a product specification. Every edge case decision in it was a product decision with downstream consequences. Treating the taxonomy document with the same rigour as a feature PRD - with explicit decision logic, worked examples, and conflict-resolution rules - was what made the cleaned data trustworthy enough to build AI features on.

03

Prevention is worth more than cleaning

The 100,000-item backlog took months to clean. The onboarding enforcement change that prevents new unclassified items takes minutes per new restaurant and compounds infinitely. The ratio of investment to return is completely inverted between retrospective cleaning and prospective enforcement. Cleaning the backlog was necessary; the onboarding fix was the real long-term solution.

04

Edge cases are not exceptions - they are the rule

In food data, the "easy" cases (Chicken Biryani → Mughlai / Biryani) are a minority. The majority of decisions involve regional variants, Banglish spelling, ambiguous fusion items, accompaniment-vs-main confusion, and context-dependent cuisine assignment. A taxonomy that only handles the easy cases is a taxonomy that produces unreliable data. The value is in the documented edge case decisions.

05

Structured data is the moat, not the model

Any company can access the same ML frameworks and foundation models. What they cannot copy is years of structured, labelled, domain-specific food data built on a well-designed taxonomy with Bangladeshi culinary context encoded into every decision. The real competitive advantage in AI is the training data quality, not the model architecture. This is what dish cleaning was actually building - a data moat.

Pathao FoodDish Data Cleaning

Pathao Food had a restaurant graph. It didn't have a food graph.

What the data actually looked like

Five categories of data rot

Why this broke product features downstream

The two-layer model: Cuisine and Dish

The taxonomy structure

Cuisine taxonomy - defined categories

Dish taxonomy - defined categories

The decision logic that makes the taxonomy consistent

The four-question decision tree

Classification examples - applied to real menu data

How 100,000+ items were cleaned at scale

The cases that challenged the taxonomy

What this data layer unlocks

How success was defined and measured

What could go wrong

What this work taught me about data as a product

Pathao Food
Dish Data Cleaning