Part VII: Engineering Reality
Research demos work. Production systems fail. This part confronts the gap between prototype and production: where AI systems break, why they break, and what you can do about it.
Models aren’t the hard part. Data pipelines are fragile, evaluation metrics miss important failures, systems drift over time, and models have fundamental limitations that no architecture or training method can fix. Understanding these failure modes helps you build more reliable systems.
Data pipelines are where most failures start. Training data has quality issues, labels are noisy, distributions shift between training and deployment. ETL processes break, schemas change, upstream systems fail. Managing data is harder than managing models.
Training and inference have different constraints and failure modes. Training optimizes for accuracy on held-out validation sets. Inference cares about latency, throughput, cost, and behavior on real user inputs. What works in training doesn’t always work in production.
Evaluation is harder than it looks. Accuracy on test sets doesn’t capture real performance. Models fail on rare cases, edge cases, and adversarial inputs. A/B tests measure aggregate metrics but miss important failures. Building reliable systems requires understanding what your metrics don’t measure.
Models hallucinate, amplify biases, and break in unexpected ways. These aren’t bugs to fix—they’re fundamental limitations of current approaches. Hallucinations stem from probabilistic generation. Bias reflects training data. Brittleness comes from pattern matching without understanding.
Safety and alignment remain unsolved. We want models that behave as intended, respect human values, and fail safely. Current approaches help but don’t solve the problem. Understanding these limits helps you deploy responsibly.
After this part, you’ll understand production realities. Part VIII looks ahead: where is AI technology heading, and what remains uncertain?