Adversarial Examples and Attacks
In 2014, researchers at Google and New York University published a finding that shook the machine learning community: by adding a carefully computed pattern of pixel noise — invisible to human eyes — to a correctly classified image, they could cause a state-of-the-art neural network to misclassify it with high confidence. A panda became a gibbon. A school bus became an ostrich. The noise was so small that a human examining the two images side by side could not tell them apart. This was the discovery of adversarial examples, and it revealed something deep and troubling about how neural networks actually represent the world.
What Is an Adversarial Example?
An adversarial example is an input that has been deliberately modified — usually by an attacker, sometimes by a researcher — to cause a model to produce a wrong or targeted output. The modification is typically designed to be imperceptible to a human observer while maximally effective at misleading the model. Formally: given a model f and an input x that f correctly classifies as class c, an adversarial example is x + delta, where the perturbation delta is small (measured by some norm, usually L-infinity or L-2), but f(x + delta) = c' where c' is different from c. The reason this is possible reveals something about what neural networks learn. When trained by gradient descent on high-dimensional inputs, models do not necessarily learn the same features humans use. Instead, they can learn to rely on high-frequency statistical patterns — textures, edge statistics — that are invisible to human visual processing but highly predictive in the training distribution. Adversarial perturbations exploit exactly these non-human features.
Adversarial examples are not primarily a sign of poor implementation. They appear to be a fundamental property of high-dimensional models trained by gradient descent on finite data. The geometry of high-dimensional space makes it almost inevitable that classifiers have nearby regions where the decision boundary is wrong. This makes the problem deep rather than merely technical.
Attack Taxonomy: White-Box, Black-Box, Targeted, Untargeted
Not all adversarial attacks are alike. Security researchers classify them along several axes. White-box attacks assume the attacker has full knowledge of the model: its architecture, its weights, and its gradients. The original FGSM (Fast Gradient Sign Method) attack is white-box — it computes the gradient of the loss with respect to the input and steps in the direction that maximizes loss. PGD (Projected Gradient Descent) attack iterates this process many times, staying within a small perturbation budget. Black-box attacks assume the attacker can only query the model — feed in inputs and observe outputs — without seeing its internals. These are more realistic in most deployment settings. Transfer attacks are a powerful black-box technique: the attacker trains a substitute model on the target model's input-output pairs, crafts adversarial examples against the substitute, then applies them to the target. Adversarial examples transfer across models with surprising frequency. Targeted attacks aim to cause a specific misclassification: not just 'make this wrong,' but 'make this be classified as class X.' Untargeted attacks only need to cause any misclassification. Targeted attacks require more computation but are more dangerous in high-stakes settings. Physical attacks operate in the real world rather than in digital space. Researchers have demonstrated stop signs with small sticker patches that consistently fool classification networks, 3D-printed objects designed to be classified as other objects from any angle, and eyeglass frames that fool face recognition systems.
Match each attack type to the assumption or property that defines it.
Terms
Definitions
Drag terms onto their definitions, or click a term then click a definition to match.
Defenses and Why They Are Hard
Defending against adversarial examples has proven remarkably difficult. Every proposed defense has either been broken by stronger attacks or comes with significant accuracy costs. Adversarial training is the most empirically successful defense: augment the training data with adversarial examples and train the model to classify them correctly. This helps, but is computationally expensive, degrades accuracy on clean inputs, and only defends against the attack types included in training. A model adversarially trained against L-infinity perturbations may still be vulnerable to other perturbation types. Certified defenses provide mathematical guarantees: a model is certified robust if it can be proven that no perturbation within a given radius can change the prediction. Randomized smoothing is one certified approach — run the model many times on randomly noised versions of the input and take the majority vote. Certified defenses currently only scale to small models and small perturbation radii. Input preprocessing — removing high-frequency components, running inputs through autoencoders — can eliminate some adversarial perturbations but can also be circumvented by attackers who account for the preprocessing. The fundamental tension is that adversarial robustness and standard accuracy appear to trade off against each other. Making a model robust often makes it slightly less accurate on clean inputs. The reason is related to the non-human features neural networks rely on: the high-frequency patterns that make models accurate may be the same ones that make them vulnerable.
Early defenses tried to hide or obscure the model's gradients, reasoning that attackers could not compute perturbations without gradient access. This is known as gradient masking or obfuscated gradients, and it consistently fails. Attackers can bypass gradient masking via transfer attacks, using finite differences to approximate gradients, or using expectation over transformations. A defense that works only because the attacker cannot see gradients is not a real defense.
A researcher adds a tiny, human-imperceptible pattern of noise to a stop-sign image. A self-driving car's vision system, which correctly identified the sign without the noise, now classifies it as a 'speed limit 45' sign. This is an example of:
An attacker cannot access the weights of a deployed image classifier. They train their own model on thousands of input-output pairs queried from the target and craft adversarial examples against their substitute. Why might these examples still fool the target?
Design a Physical Adversarial Attack Scenario
- You are a security researcher hired to assess a computer vision system deployed in a specific high-stakes physical setting (choose one: a passport control camera, an autonomous vehicle's road sign recognizer, a factory defect detector, or a medical X-ray classifier).
- Step 1: Describe your chosen system and what it is trying to classify.
- Step 2: Design an adversarial attack scenario: Who is the attacker? What is their goal? What physical perturbation could they introduce (stickers, printed overlays, physical modifications to objects)?
- Step 3: Analyze what defense would be most practical in your setting and why.
- Step 4: Identify one way the attacker could potentially defeat that defense.
- Step 5: Write a one-paragraph risk assessment: 'The adversarial attack risk for this system is [low/medium/high] because...'
- Be specific and technically grounded. Vague answers like 'add more security' are not acceptable.