Chapter 31: Data Pipelines - Where Models Are Born and Die

Every machine learning paper begins with a model architecture. The paper describes layers, attention mechanisms, and training procedures. It reports accuracy on benchmarks and compares to baselines. The data section is an afterthought: “We trained on ImageNet” or “We used Common Crawl.”

But in production, data is everything. Models are commodities—you can download GPT-4, BERT, or ResNet in minutes. What you cannot download is your data. Your users, your products, your domain. The data pipeline—how you collect, clean, label, and maintain data—determines whether your model succeeds or fails.

This chapter explains why data pipelines are where models are truly born, and where they die. Most ML failures are data failures. Understanding data pipelines is understanding the hardest part of production machine learning.

Collection: Where Data Comes From

Before a model can learn anything, data must exist. Collection is the first stage of the pipeline, and it shapes everything that follows. What you collect determines what your model can learn. What you miss determines what your model cannot learn.

Data sources vary by domain:

Web scraping and crawling: Search engines crawl billions of web pages. Language models are trained on internet text—Reddit, Wikipedia, GitHub, blogs, news sites. The quality and biases of these sources become the quality and biases of the model. Common Crawl contains misinformation, hate speech, and copyrighted content alongside valuable information.

User interactions: Recommendation systems learn from clicks, watches, and purchases. Search engines learn from queries and click-through rates. Social media learns from likes, shares, and follows. This data is valuable because it reflects real user behavior, but it is inherently noisy—users click by mistake, engagement is not always endorsement, and bots generate fake interactions.

Sensor data: Self-driving cars collect camera, lidar, and radar data. Medical devices collect physiological signals. IoT devices collect environmental measurements. Sensor data is high-volume and high-dimensional, requiring significant storage and processing infrastructure.

Manual curation: Some datasets are hand-built. ImageNet was created by humans labeling millions of images. Medical datasets require expert clinicians to annotate scans. Legal datasets require lawyers to label documents. Manual curation is expensive but produces higher-quality data.

Collection biases are inevitable. What gets collected is not a neutral sample of reality—it is what is easy to collect, valuable to collect, or legal to collect:

Geographic bias: Most data comes from wealthy, English-speaking countries. Models trained on this data perform worse in other regions.
Demographic bias: Medical data overrepresents certain demographics. Facial recognition datasets historically underrepresented darker skin tones.
Platform bias: Social media data reflects the users of that platform, not the general population. Twitter users are younger and more politically engaged than average.
Temporal bias: Data collected in one time period may not represent current reality. Fashion, language, and behavior change.

Example: Self-driving cars and corner cases

Waymo’s self-driving cars have driven millions of miles, but most of those miles are on sunny California highways. The data does not include enough snow, heavy rain, or rural roads. When deployed in new environments, the models encounter situations underrepresented in training data: construction zones with unusual lane markers, pedestrians behaving unexpectedly, road hazards not seen before.

This is not a model architecture problem. Adding more layers does not help. The problem is data: the training set does not cover the full distribution of real-world driving scenarios. Collection must actively seek rare but important cases.

Data quality issues appear at collection:

Missing values: Sensors fail, users skip form fields, APIs return incomplete records
Duplicates: Same data point appears multiple times (web crawling, user re-submissions)
Noise and outliers: Faulty sensors, data entry errors, adversarial manipulation
Inconsistent formats: Dates in different formats, text encodings, schema changes over time

Cleaning data is necessary, but cleaning cannot recover information that was never collected. If training data lacks diversity, no amount of cleaning produces a model that generalizes to diverse inputs.

Labeling: The Human Bottleneck

Most supervised learning requires labeled data: inputs paired with ground truth outputs. For many tasks, labels cannot be collected automatically—they require human judgment. Labeling is the bottleneck that determines how fast you can improve your model and how accurate your ground truth is.

The labeling process:

Task definition: Define what annotators should label and how. Ambiguous instructions lead to inconsistent labels.
Annotator selection: Choose experts (doctors labeling medical scans) or crowdworkers (labeling objects in images). Experts are expensive and slow, crowdworkers are cheap and fast but less reliable.
Annotation: Humans review data points and assign labels. This is tedious, time-consuming work.
Quality control: Check inter-annotator agreement. If two annotators label the same image differently, the task is ambiguous or instructions are unclear.
Review and iteration: Experts review crowdworker labels to catch errors.

Labeling is expensive. ImageNet cost $50,000+ in 2009 (leveraging Amazon Mechanical Turk at scale). Medical datasets require specialist physicians charging hundreds of dollars per hour. Legal document labeling requires lawyers. For most companies, labeling is the largest ML cost by far.

Label quality varies.

Inter-annotator agreement measures consistency: if two annotators label the same data, do they agree? For clear tasks like “Is this a cat?”, agreement is high (~95%). For subjective tasks like “Is this comment toxic?”, agreement is much lower (60-70%). Low agreement means the ground truth is ambiguous—the model cannot learn what “correct” means if humans disagree.

Example: Medical diagnosis labeling

A chest X-ray is labeled by three radiologists:

Radiologist A: “Pneumonia, left lower lobe”
Radiologist B: “Possible infiltrate, unclear”
Radiologist C: “No acute findings”

Which label is correct? All three are board-certified experts. The ground truth is not a physical fact but an expert judgment call. Models trained on this data inherit this ambiguity. If labels disagree, the model learns to predict the average label, which may not be what any individual expert would say.

Active learning reduces labeling costs by smartly selecting which data to label. Instead of labeling all data, the model identifies:

Uncertain examples: Data points where the model is least confident
Diverse examples: Data points that cover different parts of the input space
Disagreement: Data points where multiple models disagree

By labeling these informative examples first, active learning achieves similar accuracy with 10x less labeled data. But it requires an initial model to decide what to label, creating a chicken-and-egg problem.

Self-training and pseudo-labeling use the model to generate labels for unlabeled data, then retrain on this mixture. This works when the model is already good, but amplifies errors when the model is wrong. Pseudo-labels are cheaper than human labels but noisier.

The labeling bottleneck is fundamental: supervised learning cannot outpace the rate at which humans can provide ground truth. This is why unsupervised learning (Chapter 22) and self-supervised learning (next-token prediction, Chapter 21) are so valuable—they learn without human labels.

Drift: When Reality Changes

Machine learning models assume that training data and deployment data come from the same distribution. This assumption is almost always false. The world changes. User behavior evolves. New products launch. Adversaries adapt. The model’s training data becomes stale.

Data drift is the change in data distribution over time. There are two types:

Covariate shift: The input distribution $P(X)$ changes, but the relationship $P(Y | X)$ stays the same.

Example: A fraud detection model trained on credit card transactions from 2020 is deployed in 2024. In 2020, most transactions were in-person; by 2024, most are online. The input distribution changed (more online transactions), but the fraud patterns are similar (phishing, stolen cards, fake merchants). The model sees a different mix of transaction types than it was trained on.

Concept drift: The relationship $P(Y | X)$ changes. The same input now has a different correct output.

Example: Fraudsters adapt. In 2020, they used technique A (carding). By 2024, they switched to technique B (account takeover). The model trained on technique A fails to detect technique B. The input might look similar (same transaction amounts, same purchase categories), but the fraud patterns are fundamentally different.

Covariate shift is easier to handle than concept drift. If only $P(X)$ changes, you can sometimes reweight training data to match deployment, or retrain on recent data. If $P(Y | X)$ changes, your ground truth is wrong—you need new labels.

Detecting drift requires monitoring:

Statistical tests: Compare distributions of features in training vs production. KL divergence, Jensen-Shannon divergence, Kolmogorov-Smirnov test. If distributions diverge significantly, drift is occurring.

Model performance monitoring: Track accuracy, precision, recall on production data (requires labels for a sample). If performance degrades, either drift occurred or your model was never good.

Feature drift alerts: Monitor individual features for sudden changes. If “average transaction amount” shifts from $50 to$ 500, something changed.

Retraining strategies combat drift:

Periodic retraining: Retrain every month, quarter, year on fresh data Continuous learning: Update model online as new data arrives (risky—bad data poisons model) Triggered retraining: Detect drift, then retrain Ensemble over time: Maintain multiple models trained on different time periods, blend predictions

Retraining is expensive (compute, labeling, testing, deployment), so companies balance freshness vs cost. Google Search retrains ranking models continuously. A medical diagnosis model might retrain once a year.

Example: COVID-19 and medical models

Many medical prediction models failed during COVID-19. Models trained on pre-pandemic data assumed normal hospital patient distributions. When COVID patients flooded hospitals, the patient mix changed dramatically. Symptoms, demographics, co-morbidities—all shifted. Models predicting ICU admission or mortality gave unreliable results because $P(X)$ and $P(Y|X)$ both changed. Models had to be retrained urgently on pandemic data.

This is concept drift at crisis speed. The models were not wrong—they were trained on a world that no longer existed.

Feedback Loops: When Models Poison Data

The most insidious data problem is the feedback loop: the model’s predictions influence what data is collected next, which influences the next model, which influences future data, creating a cycle that can amplify bias or degrade quality.

How feedback loops form:

Model makes predictions in production
Users react to predictions (clicks, purchases, actions)
User reactions are logged as new training data
Model is retrained on data that includes its own influence
The cycle repeats

Feedback loops can be positive (model improves over time) or negative (model degrades, bias amplifies).

Example: YouTube recommendation feedback loop

YouTube’s recommendation algorithm suggests videos. Users click suggested videos more than random videos (the algorithm works). Click data is logged as training data: “User watched video after it was recommended.”

Next training iteration: Videos that were recommended get more clicks (because they were recommended), so they appear more engaging. The model learns “recommend videos that were previously recommended.” This creates a rich-get-richer dynamic: popular videos get recommended more, gaining more clicks, getting recommended even more.

The feedback loop can amplify bias: if the model initially recommends conspiracy theories to a small subset of users, those users click, the model learns “these users like conspiracy theories,” recommends more, users watch more, and the model doubles down. The data no longer reflects organic user preferences—it reflects algorithmically shaped preferences.

Breaking the feedback loop:

Randomization: Occasionally show random content to collect unbiased interaction data Holdout sets: Reserve some users for non-personalized experiences to measure organic behavior Causal inference: Use techniques like inverse propensity weighting to estimate what would happen without the model Logging policies: Record why the model made each prediction (recommendation reason), enabling analysis of bias

Example: Search engine click data

Search engines use click data to improve ranking. If users click result #3 more than #2, perhaps #3 should be ranked higher. But users click #1 most because it is ranked #1—position bias. The model learns “rank popular results higher,” which makes them more popular, which makes the model rank them higher.

Over time, the rich get richer: established websites dominate rankings because they have historical click data. New, high-quality sites struggle to break in. The data reflects not just relevance but past ranking decisions.

Self-fulfilling prophecies occur when models change reality to match their predictions:

Credit scoring: A model predicts someone is high-risk, they are denied credit, they cannot build credit history, confirming the model’s prediction.

Recidivism prediction: A model predicts someone will re-offend, they receive harsher sentencing, longer imprisonment increases likelihood of re-offense, confirming the model’s prediction.

Hiring tools: A model predicts someone will succeed, they are hired, they receive mentorship and opportunities, confirming the model’s prediction. Someone predicted to fail is not hired, never gets the chance, model never proven wrong.

In these cases, the model’s prediction changes the outcome it is predicting. The data is no longer ground truth—it is model-influenced reality.

Feedback Loops: When Models Poison Data diagram

Figure 31.1: Data pipeline with feedback loop. The model makes predictions, users react to those predictions, reactions are logged as new data, and the model is retrained on model-influenced data. This cycle can amplify biases and create self-fulfilling prophecies. Drift monitoring and randomization help break the loop.

Engineering Takeaway

Data quality determines model quality—garbage in, garbage out. No amount of model tuning fixes bad data. The architecture that gains 2% accuracy on ImageNet is useless if your training data is missing 50% of the categories you care about. In production, data engineering matters far more than model engineering. The teams that win are the teams that build better data pipelines.

Labeling is the bottleneck in supervised learning—active learning helps, but cannot eliminate human judgment. Labels are expensive, slow, and inconsistent. For many tasks, ground truth is subjective (content moderation, aesthetic quality, medical diagnosis). Models inherit label ambiguity. Active learning reduces labeling costs by focusing on informative examples, but you still need humans to provide truth. The dream of unsupervised learning is a dream of escaping the labeling bottleneck.

Data drift is inevitable—production models must be retrained or adapted. The world changes faster than models can keep up. Fraud patterns evolve, user preferences shift, new products launch. A model trained on last year’s data is out of date. Continuous monitoring detects drift before performance degrades. Retraining is not optional—it is the price of staying relevant. Companies that treat models as “deploy and forget” are companies whose models die slowly.

Feedback loops can amplify bias—monitor data carefully and break reinforcing cycles. When models influence the data they are trained on, self-reinforcing loops form. Bias amplifies, diversity decreases, and the data reflects algorithmic decisions rather than ground truth. Randomization, causal inference, and holdout sets help break loops. The most dangerous feedback loops are invisible—you do not see the counterfactual of what would have happened without the model.

Data versioning is essential—reproducibility requires knowing what data was used to train each model. When a model fails in production, you need to know: What data was it trained on? Has that data changed? Can you reproduce the training run? Without data versioning (like DVC, Git LFS, or custom solutions), debugging is impossible. Every model should have a lineage: this model was trained on dataset v3.2, using hyperparameters X, on date Y. Treating data like code is treating machine learning like engineering.

Pipeline monitoring catches problems before models fail—monitor data quality, distribution shifts, and labeling consistency. Models fail because data fails. Monitoring model accuracy is reactive—you see the problem after users do. Monitoring data quality is proactive—you see the problem before it reaches the model. Check for: sudden spikes in missing values, distribution shifts in key features, changes in label distribution, annotation agreement rates. If the pipeline breaks, the model will fail. Fix the pipeline, not the model.

Why most ML teams spend 80% of time on data, not models—the data pipeline is the product. Researchers spend 80% of time on models. Practitioners spend 80% of time on data. Collecting, cleaning, labeling, versioning, monitoring—this is where the work is. The model is the easy part. The data pipeline is the hard part. If someone says “I built an ML system,” they mean “I built a data pipeline and attached a model to it.” The model is the cherry on top. The data is the ice cream.

References and Further Reading

“Everyone Wants to Do the Model Work, Not the Data Work”: Data Cascades in High-Stakes AI Sambasivan, N., Kapania, S., Highfill, H., Akrong, D., Paritosh, P., & Aroyo, L. M. (2021). CHI 2021

Why it matters: This paper documented “data cascades”—compounding events where problems in data create downstream failures that multiply over time. Based on interviews with ML practitioners across the world, it revealed that data problems (poor labeling, collection bias, documentation gaps) cause most production failures, not model architecture. The paper emphasizes that data work is undervalued and under-resourced compared to model work, despite being the primary determinant of success. It is a wake-up call that data engineering is the real challenge in ML.

Hidden Technical Debt in Machine Learning Systems Sculley, D., Holt, G., Golovin, D., Davydov, E., Phillips, T., Ebner, D., Chaudhary, V., Young, M., Crespo, J.-F., & Dennison, D. (2015). NIPS 2015

Why it matters: This Google paper introduced the concept of “technical debt” in ML systems, showing that the model is a tiny part of the system—surrounded by configuration, data collection, feature extraction, monitoring, and serving infrastructure. Data dependencies are highlighted as particularly insidious: unstable data sources, legacy features, and undeclared consumers create hidden coupling. Changing data breaks models in non-obvious ways. The paper argues that managing data pipelines is harder than managing code, and that most ML system complexity is in data, not models.

How to Avoid Machine Learning Pitfalls: A Guide for Academic Researchers Lones, M. A. (2021). arXiv:2108.02497

Why it matters: This guide, aimed at researchers, covers common ML pitfalls—many of which are data problems. Leakage (test data contaminating training), selection bias (non-random splits), overfitting to test sets, and ignoring distribution shift. It emphasizes that many “SOTA results” in papers are artifacts of data problems, not genuine model improvements. The guide is a checklist for avoiding subtle data issues that invalidate results, making it essential reading for anyone working with ML in research or production.

The next chapter examines the fundamental divide between training and inference: why models trained offline must perform online, why latency matters more than accuracy in production, and how deployment transforms constraints.