Chapter 31: Data Pipelines - Where Models Are Born and Die
Every machine learning paper begins with a model architecture. The paper describes layers, attention mechanisms, and training procedures. It reports accuracy on benchmarks and compares to baselines. The data section is an afterthought: âWe trained on ImageNetâ or âWe used Common Crawl.â
But in production, data is everything. Models are commoditiesâyou can download GPT-4, BERT, or ResNet in minutes. What you cannot download is your data. Your users, your products, your domain. The data pipelineâhow you collect, clean, label, and maintain dataâdetermines whether your model succeeds or fails.
This chapter explains why data pipelines are where models are truly born, and where they die. Most ML failures are data failures. Understanding data pipelines is understanding the hardest part of production machine learning.
Collection: Where Data Comes From
Before a model can learn anything, data must exist. Collection is the first stage of the pipeline, and it shapes everything that follows. What you collect determines what your model can learn. What you miss determines what your model cannot learn.
Data sources vary by domain:
Web scraping and crawling: Search engines crawl billions of web pages. Language models are trained on internet textâReddit, Wikipedia, GitHub, blogs, news sites. The quality and biases of these sources become the quality and biases of the model. Common Crawl contains misinformation, hate speech, and copyrighted content alongside valuable information.
User interactions: Recommendation systems learn from clicks, watches, and purchases. Search engines learn from queries and click-through rates. Social media learns from likes, shares, and follows. This data is valuable because it reflects real user behavior, but it is inherently noisyâusers click by mistake, engagement is not always endorsement, and bots generate fake interactions.
Sensor data: Self-driving cars collect camera, lidar, and radar data. Medical devices collect physiological signals. IoT devices collect environmental measurements. Sensor data is high-volume and high-dimensional, requiring significant storage and processing infrastructure.
Manual curation: Some datasets are hand-built. ImageNet was created by humans labeling millions of images. Medical datasets require expert clinicians to annotate scans. Legal datasets require lawyers to label documents. Manual curation is expensive but produces higher-quality data.
Collection biases are inevitable. What gets collected is not a neutral sample of realityâit is what is easy to collect, valuable to collect, or legal to collect:
- Geographic bias: Most data comes from wealthy, English-speaking countries. Models trained on this data perform worse in other regions.
- Demographic bias: Medical data overrepresents certain demographics. Facial recognition datasets historically underrepresented darker skin tones.
- Platform bias: Social media data reflects the users of that platform, not the general population. Twitter users are younger and more politically engaged than average.
- Temporal bias: Data collected in one time period may not represent current reality. Fashion, language, and behavior change.
Example: Self-driving cars and corner cases
Waymoâs self-driving cars have driven millions of miles, but most of those miles are on sunny California highways. The data does not include enough snow, heavy rain, or rural roads. When deployed in new environments, the models encounter situations underrepresented in training data: construction zones with unusual lane markers, pedestrians behaving unexpectedly, road hazards not seen before.
This is not a model architecture problem. Adding more layers does not help. The problem is data: the training set does not cover the full distribution of real-world driving scenarios. Collection must actively seek rare but important cases.
Data quality issues appear at collection:
- Missing values: Sensors fail, users skip form fields, APIs return incomplete records
- Duplicates: Same data point appears multiple times (web crawling, user re-submissions)
- Noise and outliers: Faulty sensors, data entry errors, adversarial manipulation
- Inconsistent formats: Dates in different formats, text encodings, schema changes over time
Cleaning data is necessary, but cleaning cannot recover information that was never collected. If training data lacks diversity, no amount of cleaning produces a model that generalizes to diverse inputs.
Labeling: The Human Bottleneck
Most supervised learning requires labeled data: inputs paired with ground truth outputs. For many tasks, labels cannot be collected automaticallyâthey require human judgment. Labeling is the bottleneck that determines how fast you can improve your model and how accurate your ground truth is.
The labeling process:
- Task definition: Define what annotators should label and how. Ambiguous instructions lead to inconsistent labels.
- Annotator selection: Choose experts (doctors labeling medical scans) or crowdworkers (labeling objects in images). Experts are expensive and slow, crowdworkers are cheap and fast but less reliable.
- Annotation: Humans review data points and assign labels. This is tedious, time-consuming work.
- Quality control: Check inter-annotator agreement. If two annotators label the same image differently, the task is ambiguous or instructions are unclear.
- Review and iteration: Experts review crowdworker labels to catch errors.
Labeling is expensive. ImageNet cost $50,000+ in 2009 (leveraging Amazon Mechanical Turk at scale). Medical datasets require specialist physicians charging hundreds of dollars per hour. Legal document labeling requires lawyers. For most companies, labeling is the largest ML cost by far.
Label quality varies.
Inter-annotator agreement measures consistency: if two annotators label the same data, do they agree? For clear tasks like âIs this a cat?â, agreement is high (~95%). For subjective tasks like âIs this comment toxic?â, agreement is much lower (60-70%). Low agreement means the ground truth is ambiguousâthe model cannot learn what âcorrectâ means if humans disagree.
Example: Medical diagnosis labeling
A chest X-ray is labeled by three radiologists:
- Radiologist A: âPneumonia, left lower lobeâ
- Radiologist B: âPossible infiltrate, unclearâ
- Radiologist C: âNo acute findingsâ
Which label is correct? All three are board-certified experts. The ground truth is not a physical fact but an expert judgment call. Models trained on this data inherit this ambiguity. If labels disagree, the model learns to predict the average label, which may not be what any individual expert would say.
Active learning reduces labeling costs by smartly selecting which data to label. Instead of labeling all data, the model identifies:
- Uncertain examples: Data points where the model is least confident
- Diverse examples: Data points that cover different parts of the input space
- Disagreement: Data points where multiple models disagree
By labeling these informative examples first, active learning achieves similar accuracy with 10x less labeled data. But it requires an initial model to decide what to label, creating a chicken-and-egg problem.
Self-training and pseudo-labeling use the model to generate labels for unlabeled data, then retrain on this mixture. This works when the model is already good, but amplifies errors when the model is wrong. Pseudo-labels are cheaper than human labels but noisier.
The labeling bottleneck is fundamental: supervised learning cannot outpace the rate at which humans can provide ground truth. This is why unsupervised learning (Chapter 22) and self-supervised learning (next-token prediction, Chapter 21) are so valuableâthey learn without human labels.
Drift: When Reality Changes
Machine learning models assume that training data and deployment data come from the same distribution. This assumption is almost always false. The world changes. User behavior evolves. New products launch. Adversaries adapt. The modelâs training data becomes stale.
Data drift is the change in data distribution over time. There are two types:
Covariate shift: The input distribution changes, but the relationship stays the same.
Example: A fraud detection model trained on credit card transactions from 2020 is deployed in 2024. In 2020, most transactions were in-person; by 2024, most are online. The input distribution changed (more online transactions), but the fraud patterns are similar (phishing, stolen cards, fake merchants). The model sees a different mix of transaction types than it was trained on.
Concept drift: The relationship changes. The same input now has a different correct output.
Example: Fraudsters adapt. In 2020, they used technique A (carding). By 2024, they switched to technique B (account takeover). The model trained on technique A fails to detect technique B. The input might look similar (same transaction amounts, same purchase categories), but the fraud patterns are fundamentally different.
Covariate shift is easier to handle than concept drift. If only changes, you can sometimes reweight training data to match deployment, or retrain on recent data. If changes, your ground truth is wrongâyou need new labels.
Detecting drift requires monitoring:
Statistical tests: Compare distributions of features in training vs production. KL divergence, Jensen-Shannon divergence, Kolmogorov-Smirnov test. If distributions diverge significantly, drift is occurring.
Model performance monitoring: Track accuracy, precision, recall on production data (requires labels for a sample). If performance degrades, either drift occurred or your model was never good.
Feature drift alerts: Monitor individual features for sudden changes. If âaverage transaction amountâ shifts from 500, something changed.
Retraining strategies combat drift:
Periodic retraining: Retrain every month, quarter, year on fresh data Continuous learning: Update model online as new data arrives (riskyâbad data poisons model) Triggered retraining: Detect drift, then retrain Ensemble over time: Maintain multiple models trained on different time periods, blend predictions
Retraining is expensive (compute, labeling, testing, deployment), so companies balance freshness vs cost. Google Search retrains ranking models continuously. A medical diagnosis model might retrain once a year.
Example: COVID-19 and medical models
Many medical prediction models failed during COVID-19. Models trained on pre-pandemic data assumed normal hospital patient distributions. When COVID patients flooded hospitals, the patient mix changed dramatically. Symptoms, demographics, co-morbiditiesâall shifted. Models predicting ICU admission or mortality gave unreliable results because and both changed. Models had to be retrained urgently on pandemic data.
This is concept drift at crisis speed. The models were not wrongâthey were trained on a world that no longer existed.
Feedback Loops: When Models Poison Data
The most insidious data problem is the feedback loop: the modelâs predictions influence what data is collected next, which influences the next model, which influences future data, creating a cycle that can amplify bias or degrade quality.
How feedback loops form:
- Model makes predictions in production
- Users react to predictions (clicks, purchases, actions)
- User reactions are logged as new training data
- Model is retrained on data that includes its own influence
- The cycle repeats
Feedback loops can be positive (model improves over time) or negative (model degrades, bias amplifies).
Example: YouTube recommendation feedback loop
YouTubeâs recommendation algorithm suggests videos. Users click suggested videos more than random videos (the algorithm works). Click data is logged as training data: âUser watched video after it was recommended.â
Next training iteration: Videos that were recommended get more clicks (because they were recommended), so they appear more engaging. The model learns ârecommend videos that were previously recommended.â This creates a rich-get-richer dynamic: popular videos get recommended more, gaining more clicks, getting recommended even more.
The feedback loop can amplify bias: if the model initially recommends conspiracy theories to a small subset of users, those users click, the model learns âthese users like conspiracy theories,â recommends more, users watch more, and the model doubles down. The data no longer reflects organic user preferencesâit reflects algorithmically shaped preferences.
Breaking the feedback loop:
Randomization: Occasionally show random content to collect unbiased interaction data Holdout sets: Reserve some users for non-personalized experiences to measure organic behavior Causal inference: Use techniques like inverse propensity weighting to estimate what would happen without the model Logging policies: Record why the model made each prediction (recommendation reason), enabling analysis of bias
Example: Search engine click data
Search engines use click data to improve ranking. If users click result #3 more than #2, perhaps #3 should be ranked higher. But users click #1 most because it is ranked #1âposition bias. The model learns ârank popular results higher,â which makes them more popular, which makes the model rank them higher.
Over time, the rich get richer: established websites dominate rankings because they have historical click data. New, high-quality sites struggle to break in. The data reflects not just relevance but past ranking decisions.
Self-fulfilling prophecies occur when models change reality to match their predictions:
Credit scoring: A model predicts someone is high-risk, they are denied credit, they cannot build credit history, confirming the modelâs prediction.
Recidivism prediction: A model predicts someone will re-offend, they receive harsher sentencing, longer imprisonment increases likelihood of re-offense, confirming the modelâs prediction.
Hiring tools: A model predicts someone will succeed, they are hired, they receive mentorship and opportunities, confirming the modelâs prediction. Someone predicted to fail is not hired, never gets the chance, model never proven wrong.
In these cases, the modelâs prediction changes the outcome it is predicting. The data is no longer ground truthâit is model-influenced reality.
Figure 31.1: Data pipeline with feedback loop. The model makes predictions, users react to those predictions, reactions are logged as new data, and the model is retrained on model-influenced data. This cycle can amplify biases and create self-fulfilling prophecies. Drift monitoring and randomization help break the loop.
Engineering Takeaway
Data quality determines model qualityâgarbage in, garbage out. No amount of model tuning fixes bad data. The architecture that gains 2% accuracy on ImageNet is useless if your training data is missing 50% of the categories you care about. In production, data engineering matters far more than model engineering. The teams that win are the teams that build better data pipelines.
Labeling is the bottleneck in supervised learningâactive learning helps, but cannot eliminate human judgment. Labels are expensive, slow, and inconsistent. For many tasks, ground truth is subjective (content moderation, aesthetic quality, medical diagnosis). Models inherit label ambiguity. Active learning reduces labeling costs by focusing on informative examples, but you still need humans to provide truth. The dream of unsupervised learning is a dream of escaping the labeling bottleneck.
Data drift is inevitableâproduction models must be retrained or adapted. The world changes faster than models can keep up. Fraud patterns evolve, user preferences shift, new products launch. A model trained on last yearâs data is out of date. Continuous monitoring detects drift before performance degrades. Retraining is not optionalâit is the price of staying relevant. Companies that treat models as âdeploy and forgetâ are companies whose models die slowly.
Feedback loops can amplify biasâmonitor data carefully and break reinforcing cycles. When models influence the data they are trained on, self-reinforcing loops form. Bias amplifies, diversity decreases, and the data reflects algorithmic decisions rather than ground truth. Randomization, causal inference, and holdout sets help break loops. The most dangerous feedback loops are invisibleâyou do not see the counterfactual of what would have happened without the model.
Data versioning is essentialâreproducibility requires knowing what data was used to train each model. When a model fails in production, you need to know: What data was it trained on? Has that data changed? Can you reproduce the training run? Without data versioning (like DVC, Git LFS, or custom solutions), debugging is impossible. Every model should have a lineage: this model was trained on dataset v3.2, using hyperparameters X, on date Y. Treating data like code is treating machine learning like engineering.
Pipeline monitoring catches problems before models failâmonitor data quality, distribution shifts, and labeling consistency. Models fail because data fails. Monitoring model accuracy is reactiveâyou see the problem after users do. Monitoring data quality is proactiveâyou see the problem before it reaches the model. Check for: sudden spikes in missing values, distribution shifts in key features, changes in label distribution, annotation agreement rates. If the pipeline breaks, the model will fail. Fix the pipeline, not the model.
Why most ML teams spend 80% of time on data, not modelsâthe data pipeline is the product. Researchers spend 80% of time on models. Practitioners spend 80% of time on data. Collecting, cleaning, labeling, versioning, monitoringâthis is where the work is. The model is the easy part. The data pipeline is the hard part. If someone says âI built an ML system,â they mean âI built a data pipeline and attached a model to it.â The model is the cherry on top. The data is the ice cream.
References and Further Reading
âEveryone Wants to Do the Model Work, Not the Data Workâ: Data Cascades in High-Stakes AI Sambasivan, N., Kapania, S., Highfill, H., Akrong, D., Paritosh, P., & Aroyo, L. M. (2021). CHI 2021
Why it matters: This paper documented âdata cascadesââcompounding events where problems in data create downstream failures that multiply over time. Based on interviews with ML practitioners across the world, it revealed that data problems (poor labeling, collection bias, documentation gaps) cause most production failures, not model architecture. The paper emphasizes that data work is undervalued and under-resourced compared to model work, despite being the primary determinant of success. It is a wake-up call that data engineering is the real challenge in ML.
Hidden Technical Debt in Machine Learning Systems Sculley, D., Holt, G., Golovin, D., Davydov, E., Phillips, T., Ebner, D., Chaudhary, V., Young, M., Crespo, J.-F., & Dennison, D. (2015). NIPS 2015
Why it matters: This Google paper introduced the concept of âtechnical debtâ in ML systems, showing that the model is a tiny part of the systemâsurrounded by configuration, data collection, feature extraction, monitoring, and serving infrastructure. Data dependencies are highlighted as particularly insidious: unstable data sources, legacy features, and undeclared consumers create hidden coupling. Changing data breaks models in non-obvious ways. The paper argues that managing data pipelines is harder than managing code, and that most ML system complexity is in data, not models.
How to Avoid Machine Learning Pitfalls: A Guide for Academic Researchers Lones, M. A. (2021). arXiv:2108.02497
Why it matters: This guide, aimed at researchers, covers common ML pitfallsâmany of which are data problems. Leakage (test data contaminating training), selection bias (non-random splits), overfitting to test sets, and ignoring distribution shift. It emphasizes that many âSOTA resultsâ in papers are artifacts of data problems, not genuine model improvements. The guide is a checklist for avoiding subtle data issues that invalidate results, making it essential reading for anyone working with ML in research or production.
The next chapter examines the fundamental divide between training and inference: why models trained offline must perform online, why latency matters more than accuracy in production, and how deployment transforms constraints.