Chapter 2: Data Is the New Physics
Why Models Donât Discover Laws
Physics discovers laws. Newtonâs law of gravitation, , is a compact mathematical statement that describes how all masses attract each other, everywhere, always. Itâs universal, causal, and predictive. Once discovered, it applies to situations never observed beforeâthe motion of distant planets, the trajectory of satellites, the formation of galaxies.
Machine learning does not work this way. Machine learning models do not discover lawsâthey approximate functions. A model trained to predict housing prices has learned a statistical relationship between features (square footage, location, bedrooms) and prices in the training data. It has not discovered the underlying economics of supply and demand, construction costs, or urban planning that actually determine prices. It has only learned the patterns that happened to occur in the data.
This distinction is fundamental. Physics explains why things happen. Machine learning predicts what will happen based on what has happened before. Physics seeks generalizable principles. Machine learning seeks predictive accuracy within a data distribution.
Consider a model predicting when a user will click on an ad. The model might learn that users on mobile devices in the evening are more likely to click. But it doesnât know whyâmaybe people browse casually in the evening, maybe theyâre bored, maybe the lighting makes screens more comfortable to read. The model just knows the correlation exists in the training data.
If the pattern changesâif a new app shifts user behavior, or if ad placement algorithms evolveâthe modelâs predictions degrade. It hasnât learned a principle that transcends the data distribution. It has learned the data distribution itself.
The failure to discover universal laws manifests in systematic ways. A face recognition model trained predominantly on one demographic fails on others. It hasnât learned universal principles of facial geometryâit has learned the statistical patterns in its training set. A language model trained on English text generates coherent English but fails on code-switched text mixing languages. The patterns it learned are dataset-specific, not universal.
Newtonâs laws work on Earth and on Mars. A machine learning model trained on Earth data has no guarantee of working on Mars data. This isnât a bugâitâs the fundamental nature of learning from data rather than discovering principles. Machine learning trades generality for practicality: we accept dataset-specific patterns because discovering universal laws is often impossible, and dataset-specific patterns are good enough when the dataset is representative.
This is why machine learning models require continuous retraining. The world changes. User behavior evolves. Market conditions shift. A model trained last year may be obsolete today, not because it was poorly designed, but because the patterns it learned no longer hold. Physics doesnât have this problemâgravity hasnât changed.
Why More Data Beats Better Code
In machine learning, the quality and quantity of data matter more than the sophistication of the algorithm. A simple model trained on a million examples will typically outperform a complex model trained on a thousand examples. This is counterintuitive for engineers trained to believe that better algorithms solve problems, but itâs one of the most empirically validated findings in machine learning.
Consider machine translation. Early systems used rule-based approaches: linguists hand-coded grammatical transformations between languages. These systems were sophisticated but brittleâthey worked for the language pairs and constructions they were designed for, but failed on anything unexpected.
Then statistical machine translation emerged. Instead of rules, these systems learned from large corpora of translated text. They didnât understand grammar or semanticsâthey just counted co-occurrence patterns. âLe chienâ appears near âthe dogâ in aligned French-English texts, so the model learns to translate one to the other.
These statistical models, despite being algorithmically simpler, outperformed rule-based systems once enough parallel text became available. Googleâs systems improved dramatically not by inventing new algorithms, but by using the entire web as training data. More data revealed more patternsâidioms, rare constructions, domain-specific terminologyâthat no linguist could have anticipated.
The same pattern repeats across domains. Speech recognition improved with more transcribed audio. Image classification improved with ImageNetâs millions of labeled images. Recommendation systems improved with more user interaction data. In each case, the algorithmic innovations mattered less than the scale of the data.
ImageNet illustrates this effect concretely. Released in 2009, ImageNet provided 1.2 million labeled images across 1,000 categoriesâorders of magnitude larger than previous datasets. This scale enabled convolutional neural networks to learn robust visual features that generalized across contexts. AlexNetâs 2012 breakthrough on ImageNet wasnât primarily an algorithmic innovationâit was the combination of CNNs (known since the 1990s) with sufficient data and compute. The data unlocked the modelâs potential.
Modern language models demonstrate the same scaling law. GPT-2 (2019) had 1.5 billion parameters and was trained on 40GB of text. GPT-3 (2020) had 175 billion parameters and was trained on 570GB of text. The algorithmic differences were minorâboth were Transformer-based autoregressive models. The performance difference was dramatic: GPT-3 could perform tasks GPT-2 couldnât, primarily because it had seen vastly more data.
The Chinchilla scaling law (2022) made this precise: optimal performance requires scaling model size and training data equally. If you double compute budget, you should roughly double both parameters and training tokens. Prior models had been undertrainedâthey had sufficient parameters but insufficient data. Chinchilla, with 70B parameters trained on 1.4 trillion tokens, outperformed much larger models trained on less data. The lesson: data matters as much as model capacity.
Why does more data help so much? Because data reduces uncertainty. With limited data, many functions fit the observationsâyou canât tell which patterns are real and which are noise. With abundant data, the true patterns emerge consistently across examples, while noise averages out. The model can learn finer distinctions, rarer patterns, and more robust representations.
This has a practical implication: if your model isnât performing well, your first instinct should be to get more data, not to tune hyperparameters or try a fancier algorithm. More data gives the model more signal to learn from. Algorithmic improvements offer diminishing returns compared to doubling or 10x-ing your training set.
Noise vs Signal
Data is never perfect. Every dataset contains both signalâthe true patterns you want to learnâand noiseârandom variations, measurement errors, and irrelevant correlations. Learning means extracting the signal while ignoring the noise. This is harder than it sounds.
Signal is the systematic, repeatable pattern. In housing price data, the signal includes relationships like âlarger houses cost moreâ and âhouses near parks command a premium.â These patterns hold across many examples and generalize to new data.
Noise is the random variation. Two identical houses might sell for slightly different prices depending on the buyerâs urgency, the season, negotiation skills, or luck. This variation is real but unpredictableâit canât be learned because it doesnât repeat.
The challenge is that models canât automatically distinguish signal from noise. Both show up as patterns in the data. A model can easily overfit to noise if the noise happens to correlate with outcomes in the training set by chance. With limited data, spurious patterns look just as valid as real ones.
Consider a medical diagnosis model trained on 100 patients. Suppose 3 patients with rare last names happened to have the disease. The model might learn that certain last names predict the diseaseânot because of any biological mechanism, but because of random chance. With more data, this pattern would disappear (itâs noise), but with 100 examples, it looks like signal.
This is exacerbated by sampling biasâwhen your training data doesnât represent the full population. Surveys suffer from this: people who respond to surveys differ systematically from those who donât (response bias). Medical studies trained on clinical trial volunteers may not generalize to the broader patient populationâvolunteers tend to be healthier, more compliant, and from different demographics than typical patients.
Web scraping introduces sampling bias because the data you can scrape reflects who uses the platform and how they use it, not the broader population. A sentiment analysis model trained on Twitter data learns patterns from Twitterâs specific user demographics, which skew younger and more politically engaged than the general population. Applying this model to customer feedback from your product may fail because your customers have different communication styles.
If you train a hiring model on historical data from a company that predominantly hired from certain schools, the model learns that those schools predict success. But this might reflect historical hiring bias, not actual predictive value. The model canât tell the difference between âpeople from School X perform betterâ (signal) and âpeople from School X were hired more often due to biasâ (sampling artifact).
Selection bias occurs when your data collection process systematically excludes certain outcomes. A model predicting customer lifetime value trained only on customers who completed registration misses patterns about why people abandon registration. A credit scoring model trained only on approved loans (because you only observe default rates for loans you granted) systematically underestimates risk for marginal applications you rejected.
Label noise is another critical issue. Real-world labels are often ambiguous or inconsistent. In content moderation, different annotators disagree on whether content violates policiesâwhat looks like hate speech to one person might look like political discourse to another. This inter-annotator disagreement means your training labels contain errors. The model will try to fit these errors as if they were signal.
Even with good labels, data quality issues degrade signal. Missing values (users who donât fill in age fields), outliers (data entry errors like houses listed at $1), duplicates (the same example appearing multiple times, artificially inflating its importance), and inconsistencies (addresses formatted differently) all inject noise. Cleaning data to remove these issues is often the highest-leverage work in a machine learning project.
Measurement error is another source of noise. If your labels are incorrectâspam emails mislabeled as legitimate, or vice versaâthe model learns to reproduce those errors. Garbage in, garbage out. Data quality matters more than data quantity if the data is systematically wrong.
The implication: clean, representative data is more valuable than vast amounts of noisy, biased data. You canât learn signal that isnât there. If your data is biased, your model will be biased. If your labels are wrong, your model will learn the wrong patterns. Data engineeringâcollecting, cleaning, and validating dataâis often the most important part of building ML systems.
Irreducible Error
Some things cannot be predicted, no matter how much data you have or how sophisticated your model is. There is irreducible errorârandomness inherent in the world that cannot be eliminated by better prediction.
Consider predicting tomorrowâs weather. Meteorology is a mature science with vast amounts of data, powerful models, and deep understanding of atmospheric physics. Yet forecasts beyond 7-10 days are unreliable. Why? Because weather is a chaotic system. Tiny differences in initial conditionsâunmeasurable fluctuations in temperature or pressureâamplify over time, making long-term prediction impossible.
This isnât a failure of modeling. Itâs a fundamental property of the system. No amount of data will let you predict next monthâs weather with certainty because the system is inherently unpredictable beyond a certain time horizon.
The same applies to many machine learning problems. Predicting which specific users will click on an ad is fundamentally uncertainâhuman behavior has random components that canât be captured by features. Predicting whether a specific loan will default is uncertainâlife events (job loss, illness, divorce) are unpredictable. Predicting stock prices is uncertainâmarkets reflect the collective unpredictability of millions of actors.
Flipping a fair coin has 50% irreducible error. No model can predict the outcome better than chance because the outcome is determined by physical randomness (exact force, air resistance, rotation) that canât be measured precisely enough. Even if you could measure everything, quantum uncertainty imposes fundamental limits. Some processes are simply random.
In contrast, some prediction tasks have low irreducible error but high difficulty. Predicting whether an image contains a cat has low irreducible errorâhumans agree nearly 100% of the time, meaning the âtrueâ label is well-defined. The challenge is extracting the signal (learning what âcatâ means), not dealing with randomness. Predicting stock prices has high irreducible errorâeven perfect information about the past doesnât determine the future because prices depend on future information and aggregate human decisions.
This concept is formalized as Bayes error rateâthe lowest possible error achievable by any predictor, even with infinite data and perfect knowledge of the data distribution. It represents the irreducible error inherent in the problem. If human experts disagree 20% of the time on whether a medical image shows a tumor, the Bayes error rate is at least 20%âno model can do better than the ground truth that defines âcorrect.â
This irreducible error sets a ceiling on what models can achieve. If the best possible model can only predict with 80% accuracy due to inherent randomness, you wonât achieve 95% accuracy by trying harder. You need to accept that uncertainty is part of the problem.
The total error in a modelâs predictions can be decomposed into three sources:
- Bias: Error from incorrect assumptions in the model (e.g., assuming a linear relationship when itâs nonlinear).
- Variance: Error from sensitivity to training data (e.g., overfitting to noise).
- Irreducible error: Error from randomness in the true data-generating process.
You can reduce bias by using more flexible models. You can reduce variance by using more data or regularization. But you cannot reduce irreducible error. Itâs a property of the problem, not the solution.
In practice, this means you must design systems that operate under uncertainty. Donât expect perfect predictions. Build confidence intervals. Use probabilistic outputs rather than deterministic classifications. Route uncertain cases to human review. Accept that some errors are unavoidable and design your system to be robust to them.
Engineering Takeaway
Data is the foundation of machine learning systems. The model is secondary. If you have good data, even a simple model will work. If your data is poor, no amount of algorithmic sophistication will save you.
Invest in data infrastructure and pipelines. Building pipelines to collect, store, version, and serve data is often more important than choosing algorithms. Companies with better data infrastructureâlogging, tracking, labeling, versioningâbuild better models. Invest in instrumentation that captures ground truth labels, user feedback, and edge cases. Data versioning ensures reproducibility: you should be able to reproduce any model by knowing exactly what data it was trained on.
Prioritize data quality over quantity. A thousand clean, correctly labeled, representative examples are worth more than a million noisy, biased examples. Before scaling data collection, ensure your data is trustworthy. Check for label errors, sampling biases, and data leakage (where the model accidentally sees information it shouldnât have at inference time). Audit your data for systematic errorsâduplicates, missing values, outliers, inconsistenciesâand clean them before training.
Monitor data distribution in production. If your training data comes from one distribution (e.g., users in the US) but you deploy to another (users in Europe), the model will underperform. Collect data from the same distribution youâll deploy to. Deploy monitoring to detect distribution shift: track summary statistics of input features over time. When the model starts seeing data unlike its training set, retrain with recent data.
Use data augmentation to increase effective data size. If collecting more real data is expensive, augment existing data. For images: rotate, crop, adjust brightness. For text: synonym replacement, back-translation. For time series: add noise, warp time axis. Augmentation teaches the model invariances (a cat rotated is still a cat) and reduces overfitting without requiring new labeled examples.
Apply active learning when labeling is expensive. Donât label data randomly. Use the model to identify examples itâs most uncertain about, then label those. This targets labeling effort where it reduces error most. Active learning can achieve the same performance with 10x less labeled data by focusing on the decision boundary rather than labeling redundant examples.
Engineer features from domain knowledge. The raw data you have might not be the data your model needs. Transforming raw data into useful featuresâextracting time of day from timestamps, combining fields to create interaction terms, binning continuous variablesâis often the difference between a mediocre model and a great one. Engineers who understand the domain can create features that make the signal clearer and reduce the modelâs need to discover everything from scratch.
Default to âget more dataâ over âimprove algorithm.â If your model isnât performing well, your first move should be: can I get more training data? Can I label more examples? Can I collect more diverse examples to reduce bias? Algorithmic improvements have diminishing returns. Data improvements scale. A simple model on 10x data usually beats a sophisticated model on current data.
The lesson: Machine learning is not about finding clever algorithms. Itâs about having good data and extracting its patterns reliably. The algorithm is just the tool. The data is the raw material. No tool can build something great from poor raw materials.
References and Further Reading
The Unreasonable Effectiveness of Data â Alon Halevy, Peter Norvig, Fernando Pereira (2009) https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/35179.pdf
This paper from Google researchers demonstrates that simple models trained on massive datasets outperform sophisticated models trained on small datasets. The authors provide evidence from machine translation, speech recognition, and other domains where scaling dataânot improving algorithmsâdrove progress. Reading this will fundamentally shift how you prioritize work in machine learning projects.
The Lack of A Priori Distinctions Between Learning Algorithms â David Wolpert (1996) https://ti.arc.nasa.gov/m/profile/dhw/papers/78.pdf
This is the formal statement of the âNo Free Lunchâ theorem, which proves that no machine learning algorithm is universally better than any other across all possible problems. The implication: the quality of your data and how well it represents the problem matters more than which algorithm you choose. Algorithms matter only relative to specific problem structures. Understanding this prevents cargo-cultingâcopying algorithms that worked elsewhere without understanding whether they fit your data.
ImageNet: A Large-Scale Hierarchical Image Database â Jia Deng et al. (2009) https://www.image-net.org/static_files/papers/imagenet_cvpr09.pdf
ImageNet provided the first truly large-scale labeled image dataset (1.2 million images, 1,000 categories) and catalyzed the deep learning revolution. This paper demonstrates how dataset scale unlocks model capabilitiesâprior algorithms couldnât leverage large data, and prior datasets couldnât train powerful models. ImageNet bridged this gap and enabled CNNs to learn robust visual features. The ImageNet challenge drove five years of rapid progress in computer vision, showing how shared benchmark datasets accelerate research.