Chapter 34: Hallucinations, Bias, and Brittleness
Models do not just fail quietly by giving wrong answers. They fail in ways that are insidious, systematic, and often invisible until catastrophic damage occurs. Language models confidently generate facts that are completely false. Image classifiers predict “ostrich” when shown a school bus with tiny stickers on it. Hiring models systematically discriminate based on gender and race. Medical models trained on biased data amplify healthcare disparities.
These are not random errors. They are predictable failure modes that emerge from how models are trained, what data they see, and what they optimize. Understanding these failures is understanding why deploying AI requires guardrails, monitoring, and humility.
This chapter explains hallucinations (models generate plausible falsehoods), bias (models learn and amplify discrimination), adversarial brittleness (tiny perturbations break models), and robustness failures (models collapse under pressure). These are the failure modes that make AI dangerous.
Why Hallucinations Happen: Probability Is Not Truth
Hallucination is when a model generates outputs that are fluent, coherent, and confident, but factually false. Language models hallucinate because they optimize likelihood, not truth. They predict what text is likely to follow, not what is correct.
Why models hallucinate:
Models optimize next-token likelihood. Given a prompt, the model outputs the most probable continuation according to its training data. If training data contains misinformation, the model learns to generate misinformation. If training data lacks information on a topic, the model fabricates plausible-sounding text.
Confidence is not correctness. Models assign high probability to hallucinated content because hallucinations follow linguistic patterns seen in training. A fluent, grammatical sentence is assigned high likelihood even if factually wrong.
Example: GPT hallucinating legal cases
A lawyer used ChatGPT to write a legal brief. ChatGPT cited six case precedents—cases with official-looking names, docket numbers, and legal reasoning. All six cases were fabricated. They did not exist. ChatGPT generated plausible legal citations because it learned the pattern of how citations look, not because it verified their existence.
The model optimized likelihood: “What would a legal citation look like here?” It did not optimize truth: “Does this case exist?” The result was confidently wrong output that passed surface inspection but failed fact-checking.
Example: Medical misinformation
A user asks a medical chatbot: “What is the cure for lupus?” The model responds: “Lupus can be cured with a combination of vitamin D, turmeric, and gluten-free diet.” This is false. Lupus is a chronic autoimmune disease with no cure. Treatment involves immunosuppressants and corticosteroids, not dietary supplements.
The model hallucinated because:
- Alternative medicine misinformation is common in training data (blogs, forums)
- The pattern “X can be cured with…” appears frequently
- The model did not verify against medical consensus
The response is fluent and confident. A non-expert might believe it. This is dangerous.
Grounding and citation reduce hallucinations. Retrieval-Augmented Generation (RAG, Chapter 27) fetches documents and instructs the model to generate responses grounded in those documents. The model is less likely to hallucinate when constrained to paraphrase retrieved text. Citations let users verify claims.
But grounding does not eliminate hallucinations. Models can:
- Cite retrieved documents but misinterpret them
- Cite irrelevant documents to justify hallucinated content
- Generate confident claims despite weak evidence
Hallucinations are a fundamental property of generative models. They can be reduced but not eliminated. Any deployment of generative models must assume hallucinations will occur and design safeguards accordingly.
Bias: Data Becomes Destiny
Machine learning models learn from data. If the data reflects societal biases—sexism, racism, ableism—the model learns those biases. If the data overrepresents some groups and underrepresents others, the model performs better on overrepresented groups. Bias in data becomes bias in models.
Types of bias:
Representation bias: Some groups are underrepresented in training data. Models trained on biased data perform worse on underrepresented groups.
Example: Face recognition bias
Early face recognition datasets (e.g., Labeled Faces in the Wild) were 77% male and 83% light-skinned. Models trained on this data achieved 99% accuracy on light-skinned males but 65% accuracy on dark-skinned females. The model learned features that discriminate between light-skinned males (abundant examples) but struggled with dark-skinned females (rare examples).
This is not a model architecture problem. The model learned what the data taught. The data was biased, so the model became biased.
Measurement bias: The data uses proxies that do not capture what matters, leading to biased predictions.
Example: Recidivism prediction (COMPAS)
COMPAS is a tool used in US courts to predict recidivism (likelihood of reoffending). It uses features like prior arrests, age, and neighborhood. Studies found it falsely flagged Black defendants as high-risk twice as often as white defendants, while falsely flagging white defendants as low-risk twice as often as Black defendants.
Why? The data reflects biased policing. Black individuals are arrested more often for the same behavior due to over-policing in Black neighborhoods. The model learns that arrest history predicts recidivism, but arrest history is a biased proxy. The model amplifies existing discrimination.
Historical bias: Data reflects past discrimination. Models trained on historical data perpetuate that discrimination.
Example: Amazon hiring tool
Amazon built a resume screening tool trained on 10 years of hiring data. The model learned that male candidates were hired more often (tech industry is male-dominated). It penalized resumes containing “women’s” (e.g., “women’s chess club”) and preferred resumes with male-associated language.
The model did not learn “good candidates.” It learned “what past hires looked like.” Past hires were biased, so the model became biased. Amazon scrapped the tool.
Amplification bias: Models can amplify biases beyond what exists in training data. Small correlations in data become strong signals in models.
Example: Word embeddings and gender stereotypes
Word embeddings (Chapter 18) trained on text corpus learn associations:
- “doctor” is closer to “man” than “woman”
- “nurse” is closer to “woman” than “man”
- “programmer” is closer to “he” than “she”
These embeddings reflect gendered language in text (doctors are more often referred to as “he”). But when used in downstream tasks (search, recommendation, hiring), they amplify stereotypes. A search for “doctor” shows male doctors preferentially. A resume ranker penalizes women in technical roles.
Debiasing is hard. You can:
- Rebalance training data: Oversample underrepresented groups, undersample overrepresented groups
- Debias representations: Remove gender/race signals from embeddings
- Add fairness constraints: Penalize disparate impact during training
- Post-process outputs: Adjust predictions to equalize false positive rates across groups
None of these fully solve bias. Rebalancing changes the data distribution (test accuracy may drop). Debiasing removes explicit signals but implicit correlations remain (occupation → gender is still learned through proxies). Fairness constraints may improve one metric but worsen another (equal false positive rates may worsen overall accuracy).
Bias is not a technical problem alone—it is a sociotechnical problem. Technical fixes do not address root causes: discriminatory data collection, biased labels, unjust ground truth. You cannot debias a hiring model if historical hiring was discriminatory. You cannot debias a recidivism model if policing is discriminatory. The model learns what the data teaches. Fix the data, or accept that the model will be biased.
Adversarial Inputs: How Models Are Tricked
Adversarial examples are inputs carefully crafted to fool the model. They are imperceptible to humans but cause the model to make catastrophically wrong predictions.
Example: Adversarial stickers on stop signs
Researchers placed small, carefully designed stickers on stop signs. To humans, the stop sign looks normal. To a neural network, it is no longer a stop sign—it is classified as “speed limit 45.” A self-driving car seeing this sign might not stop, causing a crash.
The stickers are adversarial perturbations. They exploit how the model learned to recognize stop signs. The model relies on brittle features (edges, colors, textures) that can be manipulated without changing the sign’s appearance to humans.
Why adversarial examples exist:
Models learn decision boundaries in high-dimensional space. In these spaces, small perturbations (invisible to humans) can move an example from one side of the boundary to the other. The model has not learned robust features—it has learned shortcuts.
Generating adversarial examples:
- Start with a clean image (e.g., a panda)
- Compute the gradient of the loss with respect to the input:
- Modify the input in the direction that increases loss:
- The modified image looks nearly identical but the model misclassifies it
With (imperceptible change), an image classifier predicts “panda” → “gibbon” with 99% confidence.
Physical adversarial examples:
Digital adversarial examples are interesting but not directly threatening (you cannot inject noise into a real-world scene). Physical adversarial examples work in the real world:
- Adversarial glasses: 3D-printed glasses that fool face recognition
- Adversarial patches: Printed stickers that cause misclassification
- Adversarial clothing: T-shirts with patterns that make pedestrian detectors miss people
These examples exploit models deployed in physical environments: surveillance cameras, self-driving cars, security systems.
Robustness to adversarial examples is hard. Adversarial training (training on adversarial examples) improves robustness but does not eliminate vulnerability. Models can be robust to known attacks but vulnerable to new attacks. The cat-and-mouse game continues.
Why adversarial robustness matters:
- Security systems: Attackers can craft adversarial inputs to evade detection (malware classifiers, spam filters, fraud detection)
- Safety-critical systems: Self-driving cars must not misclassify stop signs
- Trustworthiness: If tiny perturbations break models, can we trust them?
Adversarial examples reveal that models learn brittle, superficial features, not robust concepts. A model that sees a stop sign with stickers and predicts “speed limit” has not learned what a stop sign is—it has learned a fragile pattern.
Brittleness and Shortcut Learning
Models are brittle: they fail on inputs slightly different from training data. Small changes in distribution, format, or context break predictions. This brittleness arises from shortcut learning: models exploit spurious correlations rather than learning robust features.
Shortcut learning examples:
Cows in fields: Image classifiers trained on photos of cows (mostly in grassy fields) learn “grass texture predicts cow.” Shown a cow on a beach, the model fails. It learned the background, not the object.
Sentiment analysis and negation: Sentiment classifiers trained on “This movie is good” (positive) and “This movie is bad” (negative) struggle with negation: “This movie is not bad” is classified as negative because “bad” is a strong negative signal. The model learned word-level shortcuts, not compositional semantics.
BERT and word overlap: BERT-based question answering models trained on SQuAD (Stanford Question Answering Dataset) learned to exploit word overlap between question and passage. Questions like “Who scored the goal?” are answered by finding the sentence with “scored” and “goal.” Adversarial datasets (SQuAD 2.0, adversarial SQuAD) break this shortcut by adding distractor sentences with high word overlap. Performance drops 40%.
NLI and overlap heuristics: Natural Language Inference models trained on SNLI/MNLI learn shortcuts: if the hypothesis contains negation words (“not,” “never”), predict “contradiction.” If the hypothesis is shorter than the premise, predict “entailment.” These shortcuts work on the training set but fail on out-of-distribution examples.
Why shortcuts are learned:
Models optimize for training accuracy using the easiest features. If a spurious correlation (grass → cow) achieves 95% accuracy, the model uses it. Learning robust features (actual cow shape) requires more data, more capacity, or better inductive biases.
Shortcuts are not bugs—they are optimal solutions to the training objective. The training set does not punish shortcuts, so models exploit them. Only out-of-distribution evaluation reveals that shortcuts fail.
Robustness failures under distribution shift:
Models trained on one distribution fail when deployed on another. Even small shifts break performance.
Example: COVID-19 and chest X-ray models
Researchers trained models to detect pneumonia from chest X-rays. Deployed during COVID-19, the models failed. Why? Training data came from specific hospitals with specific imaging protocols. COVID patients had different characteristics (disease presentation, demographics, comorbidities). The models learned hospital-specific artifacts (scanner type, positioning) as signals, not actual pathology.
The model memorized features of the training hospital, not generalizable medical features. Out-of-distribution deployment revealed this brittleness.
Lack of common sense:
Models lack world knowledge and fail on cases obvious to humans.
Example: “How many eyes does a horse have?”
Model: “Four.”
The model did not learn that horses are animals and animals have two eyes (except insects, spiders). It pattern-matched “how many” questions and generated plausible-sounding wrong answers.
Figure 34.1: Adversarial example showing how imperceptible noise (ε = 0.007) causes a confident misclassification. The model predicts “panda” with 99.8% confidence on the original image, but “gibbon” with 99.3% confidence on the perturbed image. To humans, the images are indistinguishable. To the model, they are completely different. This reveals the brittleness of learned features.
Engineering Takeaway
Hallucinations are fundamental to generative models—cannot be eliminated, only reduced through grounding, citations, and guardrails. Generative models optimize likelihood, not truth. They produce fluent, plausible output regardless of factual correctness. Retrieval-augmented generation and citation reduce hallucinations by grounding outputs in verified sources, but models can still misinterpret, cherry-pick, or fabricate despite constraints. Every deployment of generative models must assume hallucinations occur and implement verification mechanisms—human review for high-stakes domains, user-facing citations for fact-checking, confidence calibration to flag uncertain outputs.
Bias is in the data, not just the model—fixing bias requires fixing data sources, labels, and ground truth definitions. Debiasing techniques (rebalancing, fairness constraints, representation editing) address symptoms, not causes. If historical hiring data is sexist, a hiring model will be sexist. If recidivism labels reflect biased policing, a recidivism model will be biased. Technical fixes cannot eliminate bias when the data itself encodes discrimination. Addressing bias requires auditing data sources, questioning whether labels are just, and often deciding that some prediction tasks should not be automated at all.
Adversarial robustness is hard—models learn surface patterns, not deep understanding, making them vulnerable to attacks. Small perturbations invisible to humans fool models with high confidence. Adversarial training improves robustness to known attacks but does not generalize to new attacks. Physical adversarial examples (stickers, glasses, patches) threaten real-world systems. For security-critical applications (authentication, malware detection) and safety-critical applications (autonomous vehicles, medical diagnosis), adversarial vulnerabilities are unacceptable. Defense requires multiple layers: robust models, anomaly detection, redundancy, human oversight.
Distribution shift breaks models—test on realistic deployment scenarios, including edge cases and out-of-distribution inputs. Models trained on curated benchmarks fail on real-world messiness: noisy inputs, missing data, unusual formats, domain shifts, temporal changes. Test sets must include out-of-distribution examples that resemble deployment challenges. Robustness evaluation (ImageNet-C for corruption, ImageNet-A for adversarial natural examples) reveals brittleness hidden by standard benchmarks. Stress testing—deliberate probing for failures—is essential before deployment.
Guardrails are mandatory for high-stakes applications—no single model is reliable enough for unsupervised deployment. Hallucinations, bias, adversarial vulnerability, and brittleness mean models will fail. For low-stakes applications (entertainment, recommendations), failures are tolerable. For high-stakes applications (medical diagnosis, hiring, criminal justice, autonomous vehicles), failures cause harm. Guardrails mitigate risk: human-in-the-loop approval for critical decisions, confidence thresholds for flagging uncertain predictions, ensemble models for redundancy, rule-based checks for constraint violations, appeals processes for affected individuals.
Explainability helps debug but does not solve brittleness—understanding why a model failed does not make it robust. Explainability techniques (saliency maps, LIME, SHAP) show what features the model used. This is useful for debugging shortcut learning and bias. But knowing the model relies on grass texture to predict cows does not make the model robust to cows on beaches. Explainability is a diagnostic tool, not a fix. Robustness requires better training data, better inductive biases, better architectures, or constraining deployment to domains where the model is known to work.
Safety-critical systems cannot rely on ML alone—need redundancy, verification, and fallback mechanisms. A 99.9% accurate model still fails 0.1% of the time. In safety-critical domains, that is unacceptable. Self-driving cars need redundant sensors and perception systems. Medical diagnosis needs human review. Financial systems need rule-based sanity checks. ML models provide capability, but system design provides safety. Redundancy (multiple models, diverse approaches), verification (rule-based checks on outputs), fallback mechanisms (human override, safe degraded mode) ensure that single-point ML failures do not cause catastrophic outcomes.
References and Further Reading
Survey of Hallucination in Natural Language Generation Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y. J., Madotto, A., & Fung, P. (2023). ACM Computing Surveys
Why it matters: This comprehensive survey categorizes hallucination types in NLG (factual, faithfulness, instruction-following), explains causes (data quality, training objectives, decoding strategies), and reviews mitigation techniques (retrieval augmentation, fact verification, calibration). It shows that hallucinations are not rare bugs but systematic failures inherent to generation models. The survey is essential for understanding the scope of the hallucination problem and why it cannot be eliminated, only managed.
Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification Buolamwini, J., & Gebru, T. (2018). FAT 2018*
Why it matters: This paper audited three commercial face recognition systems (Microsoft, IBM, Face++) and found significant accuracy disparities: 99% accuracy on light-skinned males, 65% accuracy on dark-skinned females. The cause: biased training datasets that underrepresent darker-skinned individuals and women. The paper demonstrated that bias is measurable, significant, and harms marginalized groups. It catalyzed discussions about algorithmic fairness and led companies to audit and improve their systems. Gender Shades is a landmark in AI ethics.
Explaining and Harnessing Adversarial Examples Goodfellow, I. J., Shlens, J., & Szegedy, C. (2015). ICLR 2015
Why it matters: This paper introduced the Fast Gradient Sign Method (FGSM) for generating adversarial examples and explained why they exist: neural networks learn linear decision boundaries in high-dimensional space, where small perturbations can cross boundaries. The paper showed that adversarial examples transfer across models (black-box attacks) and proposed adversarial training as a defense. It established adversarial robustness as a fundamental ML challenge and spawned a research field on attacks and defenses.
The final chapter addresses safety, alignment, and control: why models optimize the wrong objectives, how to align them with human intent, what guardrails are necessary, and why safety is system design, not just better models.