Chapter 2: Data Is the New Physics

Why Models Don’t Discover Laws

Physics discovers laws. Newton’s law of gravitation, $F = G \frac{m_1 m_2}{r^2}$ , is a compact mathematical statement that describes how all masses attract each other, everywhere, always. It’s universal, causal, and predictive. Once discovered, it applies to situations never observed before—the motion of distant planets, the trajectory of satellites, the formation of galaxies.

Machine learning does not work this way. Machine learning models do not discover laws—they approximate functions. A model trained to predict housing prices has learned a statistical relationship between features (square footage, location, bedrooms) and prices in the training data. It has not discovered the underlying economics of supply and demand, construction costs, or urban planning that actually determine prices. It has only learned the patterns that happened to occur in the data.

This distinction is fundamental. Physics explains why things happen. Machine learning predicts what will happen based on what has happened before. Physics seeks generalizable principles. Machine learning seeks predictive accuracy within a data distribution.

Consider a model predicting when a user will click on an ad. The model might learn that users on mobile devices in the evening are more likely to click. But it doesn’t know why—maybe people browse casually in the evening, maybe they’re bored, maybe the lighting makes screens more comfortable to read. The model just knows the correlation exists in the training data.

If the pattern changes—if a new app shifts user behavior, or if ad placement algorithms evolve—the model’s predictions degrade. It hasn’t learned a principle that transcends the data distribution. It has learned the data distribution itself.

The failure to discover universal laws manifests in systematic ways. A face recognition model trained predominantly on one demographic fails on others. It hasn’t learned universal principles of facial geometry—it has learned the statistical patterns in its training set. A language model trained on English text generates coherent English but fails on code-switched text mixing languages. The patterns it learned are dataset-specific, not universal.

Newton’s laws work on Earth and on Mars. A machine learning model trained on Earth data has no guarantee of working on Mars data. This isn’t a bug—it’s the fundamental nature of learning from data rather than discovering principles. Machine learning trades generality for practicality: we accept dataset-specific patterns because discovering universal laws is often impossible, and dataset-specific patterns are good enough when the dataset is representative.

This is why machine learning models require continuous retraining. The world changes. User behavior evolves. Market conditions shift. A model trained last year may be obsolete today, not because it was poorly designed, but because the patterns it learned no longer hold. Physics doesn’t have this problem—gravity hasn’t changed.

Why More Data Beats Better Code

In machine learning, the quality and quantity of data matter more than the sophistication of the algorithm. A simple model trained on a million examples will typically outperform a complex model trained on a thousand examples. This is counterintuitive for engineers trained to believe that better algorithms solve problems, but it’s one of the most empirically validated findings in machine learning.

Consider machine translation. Early systems used rule-based approaches: linguists hand-coded grammatical transformations between languages. These systems were sophisticated but brittle—they worked for the language pairs and constructions they were designed for, but failed on anything unexpected.

Then statistical machine translation emerged. Instead of rules, these systems learned from large corpora of translated text. They didn’t understand grammar or semantics—they just counted co-occurrence patterns. “Le chien” appears near “the dog” in aligned French-English texts, so the model learns to translate one to the other.

These statistical models, despite being algorithmically simpler, outperformed rule-based systems once enough parallel text became available. Google’s systems improved dramatically not by inventing new algorithms, but by using the entire web as training data. More data revealed more patterns—idioms, rare constructions, domain-specific terminology—that no linguist could have anticipated.

The same pattern repeats across domains. Speech recognition improved with more transcribed audio. Image classification improved with ImageNet’s millions of labeled images. Recommendation systems improved with more user interaction data. In each case, the algorithmic innovations mattered less than the scale of the data.

ImageNet illustrates this effect concretely. Released in 2009, ImageNet provided 1.2 million labeled images across 1,000 categories—orders of magnitude larger than previous datasets. This scale enabled convolutional neural networks to learn robust visual features that generalized across contexts. AlexNet’s 2012 breakthrough on ImageNet wasn’t primarily an algorithmic innovation—it was the combination of CNNs (known since the 1990s) with sufficient data and compute. The data unlocked the model’s potential.

Modern language models demonstrate the same scaling law. GPT-2 (2019) had 1.5 billion parameters and was trained on 40GB of text. GPT-3 (2020) had 175 billion parameters and was trained on 570GB of text. The algorithmic differences were minor—both were Transformer-based autoregressive models. The performance difference was dramatic: GPT-3 could perform tasks GPT-2 couldn’t, primarily because it had seen vastly more data.

The Chinchilla scaling law (2022) made this precise: optimal performance requires scaling model size and training data equally. If you double compute budget, you should roughly double both parameters and training tokens. Prior models had been undertrained—they had sufficient parameters but insufficient data. Chinchilla, with 70B parameters trained on 1.4 trillion tokens, outperformed much larger models trained on less data. The lesson: data matters as much as model capacity.

Why does more data help so much? Because data reduces uncertainty. With limited data, many functions fit the observations—you can’t tell which patterns are real and which are noise. With abundant data, the true patterns emerge consistently across examples, while noise averages out. The model can learn finer distinctions, rarer patterns, and more robust representations.

This has a practical implication: if your model isn’t performing well, your first instinct should be to get more data, not to tune hyperparameters or try a fancier algorithm. More data gives the model more signal to learn from. Algorithmic improvements offer diminishing returns compared to doubling or 10x-ing your training set.

Noise vs Signal

Data is never perfect. Every dataset contains both signal—the true patterns you want to learn—and noise—random variations, measurement errors, and irrelevant correlations. Learning means extracting the signal while ignoring the noise. This is harder than it sounds.

Signal is the systematic, repeatable pattern. In housing price data, the signal includes relationships like “larger houses cost more” and “houses near parks command a premium.” These patterns hold across many examples and generalize to new data.

Noise is the random variation. Two identical houses might sell for slightly different prices depending on the buyer’s urgency, the season, negotiation skills, or luck. This variation is real but unpredictable—it can’t be learned because it doesn’t repeat.

The challenge is that models can’t automatically distinguish signal from noise. Both show up as patterns in the data. A model can easily overfit to noise if the noise happens to correlate with outcomes in the training set by chance. With limited data, spurious patterns look just as valid as real ones.

Consider a medical diagnosis model trained on 100 patients. Suppose 3 patients with rare last names happened to have the disease. The model might learn that certain last names predict the disease—not because of any biological mechanism, but because of random chance. With more data, this pattern would disappear (it’s noise), but with 100 examples, it looks like signal.

This is exacerbated by sampling bias—when your training data doesn’t represent the full population. Surveys suffer from this: people who respond to surveys differ systematically from those who don’t (response bias). Medical studies trained on clinical trial volunteers may not generalize to the broader patient population—volunteers tend to be healthier, more compliant, and from different demographics than typical patients.

Web scraping introduces sampling bias because the data you can scrape reflects who uses the platform and how they use it, not the broader population. A sentiment analysis model trained on Twitter data learns patterns from Twitter’s specific user demographics, which skew younger and more politically engaged than the general population. Applying this model to customer feedback from your product may fail because your customers have different communication styles.

If you train a hiring model on historical data from a company that predominantly hired from certain schools, the model learns that those schools predict success. But this might reflect historical hiring bias, not actual predictive value. The model can’t tell the difference between “people from School X perform better” (signal) and “people from School X were hired more often due to bias” (sampling artifact).

Selection bias occurs when your data collection process systematically excludes certain outcomes. A model predicting customer lifetime value trained only on customers who completed registration misses patterns about why people abandon registration. A credit scoring model trained only on approved loans (because you only observe default rates for loans you granted) systematically underestimates risk for marginal applications you rejected.

Label noise is another critical issue. Real-world labels are often ambiguous or inconsistent. In content moderation, different annotators disagree on whether content violates policies—what looks like hate speech to one person might look like political discourse to another. This inter-annotator disagreement means your training labels contain errors. The model will try to fit these errors as if they were signal.

Even with good labels, data quality issues degrade signal. Missing values (users who don’t fill in age fields), outliers (data entry errors like houses listed at $1), duplicates (the same example appearing multiple times, artificially inflating its importance), and inconsistencies (addresses formatted differently) all inject noise. Cleaning data to remove these issues is often the highest-leverage work in a machine learning project.

Measurement error is another source of noise. If your labels are incorrect—spam emails mislabeled as legitimate, or vice versa—the model learns to reproduce those errors. Garbage in, garbage out. Data quality matters more than data quantity if the data is systematically wrong.

The implication: clean, representative data is more valuable than vast amounts of noisy, biased data. You can’t learn signal that isn’t there. If your data is biased, your model will be biased. If your labels are wrong, your model will learn the wrong patterns. Data engineering—collecting, cleaning, and validating data—is often the most important part of building ML systems.

Irreducible Error

Some things cannot be predicted, no matter how much data you have or how sophisticated your model is. There is irreducible error—randomness inherent in the world that cannot be eliminated by better prediction.

Consider predicting tomorrow’s weather. Meteorology is a mature science with vast amounts of data, powerful models, and deep understanding of atmospheric physics. Yet forecasts beyond 7-10 days are unreliable. Why? Because weather is a chaotic system. Tiny differences in initial conditions—unmeasurable fluctuations in temperature or pressure—amplify over time, making long-term prediction impossible.

This isn’t a failure of modeling. It’s a fundamental property of the system. No amount of data will let you predict next month’s weather with certainty because the system is inherently unpredictable beyond a certain time horizon.

The same applies to many machine learning problems. Predicting which specific users will click on an ad is fundamentally uncertain—human behavior has random components that can’t be captured by features. Predicting whether a specific loan will default is uncertain—life events (job loss, illness, divorce) are unpredictable. Predicting stock prices is uncertain—markets reflect the collective unpredictability of millions of actors.

Flipping a fair coin has 50% irreducible error. No model can predict the outcome better than chance because the outcome is determined by physical randomness (exact force, air resistance, rotation) that can’t be measured precisely enough. Even if you could measure everything, quantum uncertainty imposes fundamental limits. Some processes are simply random.

In contrast, some prediction tasks have low irreducible error but high difficulty. Predicting whether an image contains a cat has low irreducible error—humans agree nearly 100% of the time, meaning the “true” label is well-defined. The challenge is extracting the signal (learning what “cat” means), not dealing with randomness. Predicting stock prices has high irreducible error—even perfect information about the past doesn’t determine the future because prices depend on future information and aggregate human decisions.

This concept is formalized as Bayes error rate—the lowest possible error achievable by any predictor, even with infinite data and perfect knowledge of the data distribution. It represents the irreducible error inherent in the problem. If human experts disagree 20% of the time on whether a medical image shows a tumor, the Bayes error rate is at least 20%—no model can do better than the ground truth that defines “correct.”

This irreducible error sets a ceiling on what models can achieve. If the best possible model can only predict with 80% accuracy due to inherent randomness, you won’t achieve 95% accuracy by trying harder. You need to accept that uncertainty is part of the problem.

The total error in a model’s predictions can be decomposed into three sources:

\text{Total Error} = \text{Bias}^2 + \text{Variance} + \text{Irreducible Error}

Bias: Error from incorrect assumptions in the model (e.g., assuming a linear relationship when it’s nonlinear).
Variance: Error from sensitivity to training data (e.g., overfitting to noise).
Irreducible error: Error from randomness in the true data-generating process.

You can reduce bias by using more flexible models. You can reduce variance by using more data or regularization. But you cannot reduce irreducible error. It’s a property of the problem, not the solution.

In practice, this means you must design systems that operate under uncertainty. Don’t expect perfect predictions. Build confidence intervals. Use probabilistic outputs rather than deterministic classifications. Route uncertain cases to human review. Accept that some errors are unavoidable and design your system to be robust to them.

Engineering Takeaway

Data is the foundation of machine learning systems. The model is secondary. If you have good data, even a simple model will work. If your data is poor, no amount of algorithmic sophistication will save you.

Invest in data infrastructure and pipelines. Building pipelines to collect, store, version, and serve data is often more important than choosing algorithms. Companies with better data infrastructure—logging, tracking, labeling, versioning—build better models. Invest in instrumentation that captures ground truth labels, user feedback, and edge cases. Data versioning ensures reproducibility: you should be able to reproduce any model by knowing exactly what data it was trained on.

Prioritize data quality over quantity. A thousand clean, correctly labeled, representative examples are worth more than a million noisy, biased examples. Before scaling data collection, ensure your data is trustworthy. Check for label errors, sampling biases, and data leakage (where the model accidentally sees information it shouldn’t have at inference time). Audit your data for systematic errors—duplicates, missing values, outliers, inconsistencies—and clean them before training.

Monitor data distribution in production. If your training data comes from one distribution (e.g., users in the US) but you deploy to another (users in Europe), the model will underperform. Collect data from the same distribution you’ll deploy to. Deploy monitoring to detect distribution shift: track summary statistics of input features over time. When the model starts seeing data unlike its training set, retrain with recent data.

Use data augmentation to increase effective data size. If collecting more real data is expensive, augment existing data. For images: rotate, crop, adjust brightness. For text: synonym replacement, back-translation. For time series: add noise, warp time axis. Augmentation teaches the model invariances (a cat rotated is still a cat) and reduces overfitting without requiring new labeled examples.

Apply active learning when labeling is expensive. Don’t label data randomly. Use the model to identify examples it’s most uncertain about, then label those. This targets labeling effort where it reduces error most. Active learning can achieve the same performance with 10x less labeled data by focusing on the decision boundary rather than labeling redundant examples.

Engineer features from domain knowledge. The raw data you have might not be the data your model needs. Transforming raw data into useful features—extracting time of day from timestamps, combining fields to create interaction terms, binning continuous variables—is often the difference between a mediocre model and a great one. Engineers who understand the domain can create features that make the signal clearer and reduce the model’s need to discover everything from scratch.

Default to “get more data” over “improve algorithm.” If your model isn’t performing well, your first move should be: can I get more training data? Can I label more examples? Can I collect more diverse examples to reduce bias? Algorithmic improvements have diminishing returns. Data improvements scale. A simple model on 10x data usually beats a sophisticated model on current data.

The lesson: Machine learning is not about finding clever algorithms. It’s about having good data and extracting its patterns reliably. The algorithm is just the tool. The data is the raw material. No tool can build something great from poor raw materials.

References and Further Reading

The Unreasonable Effectiveness of Data – Alon Halevy, Peter Norvig, Fernando Pereira (2009) https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/35179.pdf

This paper from Google researchers demonstrates that simple models trained on massive datasets outperform sophisticated models trained on small datasets. The authors provide evidence from machine translation, speech recognition, and other domains where scaling data—not improving algorithms—drove progress. Reading this will fundamentally shift how you prioritize work in machine learning projects.

The Lack of A Priori Distinctions Between Learning Algorithms – David Wolpert (1996) https://ti.arc.nasa.gov/m/profile/dhw/papers/78.pdf

This is the formal statement of the “No Free Lunch” theorem, which proves that no machine learning algorithm is universally better than any other across all possible problems. The implication: the quality of your data and how well it represents the problem matters more than which algorithm you choose. Algorithms matter only relative to specific problem structures. Understanding this prevents cargo-culting—copying algorithms that worked elsewhere without understanding whether they fit your data.

ImageNet: A Large-Scale Hierarchical Image Database – Jia Deng et al. (2009) https://www.image-net.org/static_files/papers/imagenet_cvpr09.pdf

ImageNet provided the first truly large-scale labeled image dataset (1.2 million images, 1,000 categories) and catalyzed the deep learning revolution. This paper demonstrates how dataset scale unlocks model capabilities—prior algorithms couldn’t leverage large data, and prior datasets couldn’t train powerful models. ImageNet bridged this gap and enabled CNNs to learn robust visual features. The ImageNet challenge drove five years of rapid progress in computer vision, showing how shared benchmark datasets accelerate research.