๐ฏ Lesson 1: Evasion Attacks (FGSM, PGD, C&W)
Master the most common evasion attacks including Fast Gradient Sign Method, Projected Gradient Descent, and Carlini & Wagner attacks
๐ Learning Objectives
By the end of this lesson, you will be able to:
- Understand the principles behind evasion attacks
- Implement Fast Gradient Sign Method (FGSM)
- Apply Projected Gradient Descent (PGD) attacks
- Execute Carlini & Wagner (C&W) attacks
- Evaluate attack effectiveness and stealth
- Implement detection and mitigation strategies
๐ฌ Understanding Evasion Attacks
What are Evasion Attacks?
Evasion attacks are adversarial techniques that craft inputs specifically designed to fool machine learning models while appearing legitimate to humans. These attacks exploit the model's decision boundaries and sensitivity to small perturbations.
๐ฏ Key Characteristics:
- Minimal Perturbation: Small changes to input data
- Targeted Misclassification: Specific wrong predictions
- Stealth: Changes should be imperceptible to humans
- Transferability: Attacks may work across different models
Attack Taxonomy
๐ By Knowledge Level
- White-box: Full access to model architecture and parameters
- Black-box: Only input/output access
- Gray-box: Partial knowledge (architecture but not weights)
๐ฏ By Attack Goal
- Targeted: Force specific wrong prediction
- Untargeted: Any wrong prediction acceptable
- Universal: Single perturbation works for multiple inputs
โก Fast Gradient Sign Method (FGSM)
Mathematical Foundation
FGSM is a simple yet effective one-step attack that uses the gradient of the loss function to determine the direction of perturbation.
๐ FGSM Formula:
x_adv = x + ฮต * sign(โ_x J(ฮธ, x, y))
Where:
- x_adv: Adversarial example
- x: Original input
- ฮต: Perturbation magnitude (epsilon)
- โ_x J: Gradient of loss function w.r.t. input
- y: True label
Implementation Example
import torch
import torch.nn.functional as F
def fgsm_attack(model, data, target, epsilon):
"""
Fast Gradient Sign Method Attack
Args:
model: The model to attack
data: Input data (batch)
target: True labels
epsilon: Attack strength
Returns:
adversarial_data: Perturbed inputs
"""
# Set requires_grad to True for input data
data.requires_grad = True
# Forward pass
output = model(data)
loss = F.nll_loss(output, target)
# Backward pass to compute gradients
model.zero_grad()
loss.backward()
# Collect gradients
data_grad = data.grad.data
# Create adversarial example
perturbed_data = data + epsilon * data_grad.sign()
# Clamp to valid range [0, 1]
perturbed_data = torch.clamp(perturbed_data, 0, 1)
return perturbed_data
โ๏ธ FGSM Parameters:
- ฮต (epsilon): Controls attack strength (0.01-0.3 typical)
- Loss function: Cross-entropy most common
- Gradient direction: Sign function for Lโ norm
๐ Projected Gradient Descent (PGD)
Iterative Refinement
PGD is an iterative version of FGSM that applies multiple small steps to find stronger adversarial examples within a specified norm ball.
๐ PGD Algorithm:
x_0 = x + uniform_noise(-ฮต, ฮต)
For i = 1 to num_iterations:
x_i = x_{i-1} + ฮฑ * sign(โ_x J(ฮธ, x_{i-1}, y))
x_i = project_to_ball(x_i, x, ฮต)
Implementation Example
def pgd_attack(model, data, target, epsilon, alpha, num_iter):
"""
Projected Gradient Descent Attack
Args:
model: The model to attack
data: Input data (batch)
target: True labels
epsilon: Maximum perturbation (Lโ norm)
alpha: Step size per iteration
num_iter: Number of iterations
Returns:
adversarial_data: Perturbed inputs
"""
# Initialize with random noise
perturbed_data = data + torch.empty_like(data).uniform_(-epsilon, epsilon)
perturbed_data = torch.clamp(perturbed_data, 0, 1)
for i in range(num_iter):
perturbed_data.requires_grad = True
# Forward pass
output = model(perturbed_data)
loss = F.nll_loss(output, target)
# Compute gradients
model.zero_grad()
loss.backward()
# Update perturbation
with torch.no_grad():
perturbed_data = perturbed_data + alpha * perturbed_data.grad.sign()
# Project back to epsilon ball
delta = torch.clamp(perturbed_data - data, -epsilon, epsilon)
perturbed_data = data + delta
# Clamp to valid range
perturbed_data = torch.clamp(perturbed_data, 0, 1)
return perturbed_data
โ๏ธ PGD Parameters:
- ฮต (epsilon): Maximum perturbation (0.01-0.3)
- ฮฑ (alpha): Step size (typically ฮต/4)
- num_iter: Number of iterations (10-40 typical)
๐ฏ Carlini & Wagner (C&W) Attack
Optimization-Based Approach
C&W attack formulates adversarial example generation as an optimization problem, often finding smaller perturbations than gradient-based methods.
๐ C&W Objective Function:
minimize ||ฮด||_p + c * f(x + ฮด)
subject to:
- x + ฮด โ [0, 1]^n
- x + ฮด โ valid_input_space
Where f(x + ฮด) is the objective function that encourages misclassification.
Implementation Example
def cw_attack(model, data, target, c=1.0, kappa=0, max_iter=1000,
binary_search_steps=9, learning_rate=0.01):
"""
Carlini & Wagner L2 Attack
Args:
model: The model to attack
data: Input data (single sample)
target: True label
c: Initial constant for binary search
kappa: Confidence parameter
max_iter: Maximum optimization iterations
binary_search_steps: Binary search steps for c
learning_rate: Adam optimizer learning rate
Returns:
adversarial_data: Perturbed input
"""
device = data.device
batch_size = data.shape[0]
# Binary search for c
c_low = 0.0
c_high = c
c_best = c
# Initialize adversarial example
delta = torch.zeros_like(data, requires_grad=True)
optimizer = torch.optim.Adam([delta], lr=learning_rate)
for search_step in range(binary_search_steps):
c = (c_low + c_high) / 2
for iteration in range(max_iter):
optimizer.zero_grad()
# Forward pass
perturbed_data = data + delta
output = model(perturbed_data)
# C&W objective function
target_score = output.gather(1, target.unsqueeze(1)).squeeze(1)
max_other_score = torch.max(output * (1 - F.one_hot(target, output.shape[1])), dim=1)[0]
# Objective function
f = torch.clamp(max_other_score - target_score + kappa, min=0.0)
# Total loss
loss = torch.sum(delta**2) + c * torch.sum(f)
loss.backward()
optimizer.step()
# Check if attack succeeded
with torch.no_grad():
prediction = torch.argmax(model(data + delta), dim=1)
success = (prediction != target).float()
if torch.all(success):
c_best = c
c_high = c
break
if not torch.all(success):
c_low = c
return torch.clamp(data + delta.detach(), 0, 1)
โ๏ธ C&W Parameters:
- c: Confidence parameter (0.1-10)
- ฮบ (kappa): Margin parameter (0-10)
- max_iter: Optimization iterations (1000)
- learning_rate: Adam learning rate (0.01)
๐งช Hands-On Exercise
Exercise: Implement and Compare Evasion Attacks
Objective: Implement FGSM, PGD, and C&W attacks on a pre-trained model and compare their effectiveness.
๐ Steps:
-
Setup Environment
Install required libraries and load a pre-trained model:
pip install torch torchvision tensorboard pip install cleverhans # Adversarial attack library # Load pre-trained model import torch import torchvision.models as models model = models.resnet50(pretrained=True) model.eval() -
Prepare Test Data
Load and preprocess test images:
from torchvision import datasets, transforms transform = transforms.Compose([ transforms.Resize(256), transforms.CenterCrop(224), transforms.ToTensor(), transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]) ]) test_dataset = datasets.ImageNet(root='./data', split='val', transform=transform) test_loader = torch.utils.data.DataLoader(test_dataset, batch_size=1) -
Implement Attack Comparison
Create a function to compare all three attacks:
def compare_attacks(model, data, target, epsilon=0.03): """ Compare FGSM, PGD, and C&W attacks """ results = {} # FGSM Attack fgsm_adv = fgsm_attack(model, data, target, epsilon) fgsm_pred = torch.argmax(model(fgsm_adv), dim=1) results['FGSM'] = { 'success': (fgsm_pred != target).float().mean().item(), 'perturbation': torch.norm(fgsm_adv - data, p=float('inf')).item() } # PGD Attack pgd_adv = pgd_attack(model, data, target, epsilon, epsilon/4, 20) pgd_pred = torch.argmax(model(pgd_adv), dim=1) results['PGD'] = { 'success': (pgd_pred != target).float().mean().item(), 'perturbation': torch.norm(pgd_adv - data, p=float('inf')).item() } # C&W Attack cw_adv = cw_attack(model, data, target) cw_pred = torch.argmax(model(cw_adv), dim=1) results['C&W'] = { 'success': (cw_pred != target).float().mean().item(), 'perturbation': torch.norm(cw_adv - data, p=2).item() } return results -
Visualize Results
Create visualizations comparing attack effectiveness:
import matplotlib.pyplot as plt def visualize_attacks(original, fgsm_adv, pgd_adv, cw_adv): """ Visualize original and adversarial examples """ fig, axes = plt.subplots(1, 4, figsize=(16, 4)) # Denormalize for visualization def denormalize(tensor): mean = torch.tensor([0.485, 0.456, 0.406]).view(3, 1, 1) std = torch.tensor([0.229, 0.224, 0.225]).view(3, 1, 1) return tensor * std + mean images = [original, fgsm_adv, pgd_adv, cw_adv] titles = ['Original', 'FGSM', 'PGD', 'C&W'] for i, (img, title) in enumerate(zip(images, titles)): img_denorm = torch.clamp(denormalize(img.squeeze()), 0, 1) axes[i].imshow(img_denorm.permute(1, 2, 0)) axes[i].set_title(title) axes[i].axis('off') plt.tight_layout() plt.show()
๐ Deliverables:
- Working implementation of all three attacks
- Attack success rate comparison
- Perturbation magnitude analysis
- Visual comparison of adversarial examples
- Performance benchmark results