Chapter 5: Features: How Machines See the World

Why Raw Data Is Unusable

Machine learning models do not operate on raw reality. They operate on numbers—vectors of features that represent reality in a form the model can process. The quality of these features determines the ceiling on what the model can learn. No amount of algorithmic sophistication can compensate for poor features.

Consider image classification. An image is a grid of pixels, each with RGB color values. For a 256×256 image, that’s 196,608 numbers. But these numbers, as presented, encode very little about what the image contains. Pixel (142, 87) being red tells you almost nothing about whether the image contains a dog or a cat. The information is there, but it’s not accessible to simple models.

A linear model cannot learn directly from pixels. It would need to learn that certain patterns of pixel values (shapes, textures, edges) correspond to object categories. But pixels don’t encode shapes—they encode colors at specific coordinates. The relationship between “this pixel is red” and “this image contains a dog” is extraordinarily complex, involving global spatial structure, lighting, perspective, and occlusion.

The same problem occurs with text. A sentence is a sequence of characters or words. But character codes (“a” = 97, “b” = 98) tell you nothing about meaning. The model needs features that capture semantics: “dog” and “puppy” are similar; “good” and “bad” are opposites. Raw token IDs don’t encode these relationships.

Audio faces similar challenges. A waveform is a sequence of amplitude values over time. But to recognize speech, the model needs features representing phonemes, intonation, speaker characteristics—not raw pressure measurements. Speech recognition systems typically convert waveforms to spectrograms (time-frequency representations) that make phonetic patterns visible. Raw audio is a 1D signal; spectrograms are 2D images where patterns (vowels, consonants, intonation) form recognizable shapes.

Time series data presents a related problem. Raw sensor readings—temperature, pressure, acceleration—capture instantaneous values. But patterns often emerge over windows of time: daily cycles, weekly trends, anomalies relative to baselines. Models need features that aggregate and compare across time: moving averages, variance over windows, autocorrelations, trend directions.

High-dimensional raw data also suffers from the curse of dimensionality. As dimensions increase, data becomes sparse—most of the space is empty, and examples are far apart. A linear model in 196,608 dimensions (pixel space) has so many degrees of freedom that it easily overfits. Feature engineering reduces dimensionality by extracting the relevant structure, making data denser and more learnable.

This is why feature engineering dominated classical machine learning. Practitioners spent most of their effort transforming raw data into representations that made patterns obvious. This transformation—from raw data to features—is where learning actually begins.

What a Feature Is

A feature is a measurable property of the data that is relevant to the prediction task. Features translate raw data into a representation where patterns are accessible to models. Good features make learning easy. Bad features make learning impossible, no matter how sophisticated the model.

For structured data—tables with columns—features are often given: age, income, purchase history. But even here, feature engineering improves performance. You might create:

Derived features: age² (to capture nonlinear effects), income/debt ratio
Interaction features: age × income (different effects at different life stages)
Temporal features: days since last purchase, purchase frequency
Aggregations: average purchase value over the last 30 days

For unstructured data—images, text, audio—feature engineering is more critical. Classical approaches manually designed features that captured domain knowledge.

Image features:

Edges: Detect boundaries between regions using filters (Sobel, Canny).
Textures: Measure local patterns using Gabor filters or local binary patterns.
Color histograms: Distribution of colors in the image.
HOG (Histogram of Oriented Gradients): Count edge orientations in local regions, capturing object shape.
SIFT/SURF: Scale-invariant keypoint descriptors that identify distinctive local features.

These features transform pixels into higher-level representations: “this region has strong vertical edges,” “this texture is smooth,” “this keypoint is distinctive.” A linear model can then learn that certain combinations of edges and textures indicate a dog.

Text features:

Bag of words: Count how often each word appears, ignoring order.
TF-IDF: Weight words by how distinctive they are (frequent in this document, rare overall).
N-grams: Capture short sequences (“New York,” “not good”).
Word embeddings: Dense vectors where similar words have similar vectors.

These features transform text from character sequences into numerical representations that preserve semantic relationships.

The key insight: features define the hypothesis space—the set of functions the model can learn. If the features don’t encode the relevant information, the model cannot learn the pattern, no matter how complex it is. Features transform the input space (pixels, characters) into a feature space where patterns are more accessible. Ideally, feature space makes data linearly separable—different classes cluster in different regions, so a linear model can separate them.

The kernel trick (used in SVMs) takes this idea further: it implicitly maps data to a very high-dimensional feature space where linear separation becomes possible without explicitly computing the features. This shows that representation—the space you learn in—matters more than the model you use.

Conversely, with the right features, even simple models work well. This is why feature engineering was the highest-leverage activity in classical machine learning.

Manual vs Learned Features

Classical machine learning separated feature engineering from model training. A human expert designed features based on domain knowledge, and the model learned to combine those features. This two-stage process worked well but was labor-intensive and required deep expertise.

Deep learning changed this by learning features automatically. Neural networks do end-to-end learning: raw data goes in, predictions come out, and intermediate layers learn useful features without human intervention. This is representation learning.

Classical ML pipeline:

Human expert designs features based on domain knowledge.
Model learns to combine features (linear weights, tree splits).
Performance depends critically on feature quality.

Deep learning pipeline:

Raw data is fed to the network.
Early layers learn low-level features (edges, textures).
Middle layers learn mid-level features (parts, patterns).
Late layers learn high-level features (concepts, categories).
Final layer makes predictions.

The difference is profound. In classical ML, the model learns a function of fixed features. In deep learning, the model learns both the features and the function. This flexibility allows neural networks to discover representations humans never considered.

For images, convolutional neural networks (CNNs) learn hierarchical features:

Layer 1: Detects edges at different orientations.
Layer 2: Combines edges into shapes and textures.
Layer 3: Combines shapes into parts (eyes, ears, wheels).
Layer 4: Combines parts into objects (faces, cars).

No human programmed these features. The network learned them by minimizing classification loss on labeled images. The features emerged because they were useful for the task.

Why do learned features often beat manually engineered features? Several reasons:

Adaptation to data: Learned features optimize for the specific dataset and task, discovering patterns humans might miss. Manual features encode general intuitions that may not align perfectly with the data.
Discovery of unexpected patterns: Networks find features humans wouldn’t think to design. ImageNet-trained CNNs learn to detect patterns (fur textures, specific shapes) that are predictive but not obvious.
Joint optimization: Learned features and final classifier are optimized together (end-to-end), ensuring features are maximally useful for the task. Manual features are fixed before training, potentially discarding relevant information.
Scalability: Once a neural network architecture works, it scales to more data without additional human effort. Manual feature engineering requires expert time for each new domain.

However, learned features require substantial data. With limited data, manual features incorporating domain knowledge often outperform end-to-end learning because they inject prior knowledge that compensates for data scarcity.

Manual vs Learned Features diagram

The diagram shows how neural networks learn hierarchical features. Early layers learn simple features (edges), middle layers learn combinations (shapes), and later layers learn abstract concepts. The representation space transforms from mixed and unstructured (raw pixels) to separated and structured (learned features).

This automatic feature learning is why deep learning succeeded where classical ML struggled on perceptual tasks. Human experts couldn’t design features good enough to capture the complexity of natural images, speech, or language. Neural networks could.

Hierarchies of Representation

The key to deep learning’s success is hierarchy. Features are learned in layers, where each layer builds on the previous one. Early layers learn simple, general features. Later layers learn complex, task-specific features. This mirrors how human perception works.

When you see a face, your visual system processes it hierarchically:

Photoreceptors detect light and dark.
V1 neurons detect edges and orientations.
V2 neurons detect shapes and contours.
V4 neurons detect object parts.
IT cortex neurons recognize whole objects and faces.

Neural networks learn similar hierarchies. A CNN trained on ImageNet develops detectors for:

Layer 1: Horizontal edges, vertical edges, diagonal edges, color blobs. These are general features that appear in almost any image.
Layer 2: Corners formed by edge combinations, circles, simple textures (stripes, dots). These compose edges into basic shapes.
Layer 3: Complex patterns, recurring textures (fur, fabric, water), repeated elements (windows on buildings). These compose shapes into distinctive patterns.
Layer 4: Object parts—wheels, eyes, faces, text, windows. These compose patterns into recognizable components.
Layer 5: Whole objects—cars (wheels + windows + body), dogs (face + fur + body), buildings (walls + windows). These compose parts into complete concepts.

These features are composable. A “face” feature is built from “eye,” “nose,” and “mouth” features, which are built from edge and shape features. This compositionality makes learning efficient: you learn low-level features once and reuse them to build many high-level concepts. Instead of learning 1,000 object detectors from scratch, you learn a small set of shared low-level features and compose them differently for each object.

The same principle applies to text. A language model learns:

Character/token level: Spelling patterns, common prefixes/suffixes, character transitions.
Word level: Syntax, part-of-speech, word co-occurrence, morphology.
Phrase level: Common expressions (“in spite of”), grammatical structures, idioms.
Sentence level: Semantic relationships, context, syntactic dependencies.
Document level: Topics, themes, discourse structure, argument flow.

Higher layers build meaning by composing simpler patterns. The word “bank” has multiple meanings (financial institution, river edge), but sentence-level features resolve ambiguity from context.

Why depth matters: Deeper models learn more abstract representations. A 3-layer network might learn edges → shapes → simple objects. A 50-layer network learns edges → textures → patterns → parts → assemblies → objects → scenes → abstract concepts. Depth allows the network to build hierarchies of concepts, where each layer refines and abstracts the previous layer’s representations.

Empirically, deep networks outperform shallow wide networks with the same number of parameters. This suggests hierarchy—composing features through layers—is more powerful than learning a flat function.

This is why transfer learning works: features learned on one task (ImageNet classification) transfer to other tasks (medical imaging, satellite analysis) because the low-level features (edges, textures) are general. You can use a pretrained network’s early layers as-is and only retrain later layers for your specific task.

Engineering Takeaway

Features determine what a model can learn. Investing in better features pays off more than investing in better algorithms.

Preprocess and normalize systematically. Raw data is rarely in the right form for learning. Normalization (scaling features to similar ranges, e.g., z-score or min-max) prevents some features from dominating due to scale differences. Encoding categorical variables (one-hot for low cardinality, embeddings for high cardinality) makes them usable. Handling missing values (imputation with mean/median/model predictions, or masking) prevents failures. Time spent on preprocessing is time well spent—poor preprocessing is a common cause of training instability and poor performance.

Feature engineering still matters, especially for tabular data. Even with deep learning, domain knowledge helps. For tabular data (customer records, sensor logs, financial transactions), engineered features (ratios, aggregations, time-based features, interaction terms) often outperform raw columns, even with neural networks. Deep learning excels on perceptual data (images, text, audio) where structure is implicit. For structured data where features have explicit meaning, manual engineering remains valuable.

Use data augmentation as feature-space expansion. For images, data augmentation (rotation, cropping, color jittering, flipping) creates variation that helps learning. This implicitly teaches the model invariances: a rotated cat is still a cat. For text, augmentation includes synonym replacement, back-translation, paraphrasing. For tabular data, consider noise injection or SMOTE (synthetic minority oversampling). Augmentation is a form of regularization that reduces overfitting by expanding the effective training set.

Leverage embeddings as learned dense representations. Word embeddings (Word2Vec, GloVe) and contextualized embeddings (BERT, GPT) are features learned by neural networks on large text corpora. These features capture semantic relationships better than hand-designed features like TF-IDF or bag-of-words. The same principle extends to other domains: node embeddings for graphs, user/item embeddings for recommendation, image embeddings for retrieval. Embeddings compress sparse, high-dimensional data (vocabulary, user IDs) into dense, low-dimensional vectors where similar entities are close.

Use transfer learning to leverage pretrained features. Pretrained models (ResNet for images, BERT for text, Wav2Vec for audio) have already learned good features on large datasets (ImageNet, web text, speech corpora). You can use these features for your specific task by fine-tuning (continuing training on your data) or using the model as a feature extractor (freezing early layers, training only final layers). This is faster and more effective than training from scratch, especially with limited data. Transfer learning works because learned features are general—low-level features transfer broadly, high-level features transfer to similar tasks.

Monitor feature distributions in production for drift detection. In production, feature distributions can shift (concept drift). If input features change—users behave differently, sensors degrade, markets shift—model performance degrades even if the model itself hasn’t changed. Monitor feature statistics (mean, variance, quantiles, entropy) over time and compare to training distribution. Significant divergence signals that retraining is needed. Feature drift often precedes performance degradation, giving early warning.

Design interpretability into feature engineering. To explain why a model made a prediction, you need to understand what features it’s using. For classical models, this means keeping features interpretable: “age > 30” is interpretable, a complex polynomial of age is not. For neural networks, use interpretability tools (saliency maps show which pixels mattered, attention weights show which words mattered) to understand what patterns the network learned. Interpretability is easier with meaningful features than with raw data.

The lesson: Features are the interface between reality and models. Good features make patterns obvious; bad features hide patterns. Classical ML required manual feature engineering based on domain expertise. Deep learning automates it through representation learning, discovering features from data. But in both cases, the quality of features—whether designed or learned—is what determines success. Invest in understanding and improving features, and model performance will follow.

References and Further Reading

Feature Engineering for Machine Learning – Alice Zheng and Amanda Casari https://www.oreilly.com/library/view/feature-engineering-for/9781491953235/

This book is a practical guide to feature engineering for structured data. It covers handling categorical variables, numerical transformations, text features, and time-series features with concrete examples and code. Reading this will teach you the techniques practitioners use to improve model performance through better features. Essential for anyone working with tabular data or building features for classical ML.

Representation Learning: A Review and New Perspectives – Yoshua Bengio, Aaron Courville, Pascal Vincent (2013) https://arxiv.org/abs/1206.5538

This paper surveys representation learning—the idea that models should learn features rather than rely on hand-engineering. Bengio explains why deep learning works: hierarchical feature learning, compositionality, and distributed representations. The paper connects classical feature engineering, autoencoders, RBMs, and modern deep learning, providing both theoretical foundations and empirical insights. Reading this gives you the conceptual framework for understanding why neural networks discover useful features automatically.

Visualizing and Understanding Convolutional Networks – Matthew Zeiler and Rob Fergus (2013) https://arxiv.org/abs/1311.2901

This paper visualizes what features CNNs learn at each layer. Zeiler and Fergus use deconvolution to project activations back to pixel space, revealing that early layers detect edges and colors, middle layers detect textures and patterns, and late layers detect object parts and whole objects. Reading this (and studying the figures) will give you concrete intuition for hierarchical feature learning in neural networks. It makes abstract concepts (representation learning, hierarchical features) visually concrete.