๐Ÿ“š Learning Objectives

By the end of this lesson, you will be able to:

๐Ÿ”ฌ Understanding Evasion Attacks

What are Evasion Attacks?

Evasion attacks are adversarial techniques that craft inputs specifically designed to fool machine learning models while appearing legitimate to humans. These attacks exploit the model's decision boundaries and sensitivity to small perturbations.

๐ŸŽฏ Key Characteristics:

  • Minimal Perturbation: Small changes to input data
  • Targeted Misclassification: Specific wrong predictions
  • Stealth: Changes should be imperceptible to humans
  • Transferability: Attacks may work across different models

Attack Taxonomy

๐Ÿ“Š By Knowledge Level

  • White-box: Full access to model architecture and parameters
  • Black-box: Only input/output access
  • Gray-box: Partial knowledge (architecture but not weights)

๐ŸŽฏ By Attack Goal

  • Targeted: Force specific wrong prediction
  • Untargeted: Any wrong prediction acceptable
  • Universal: Single perturbation works for multiple inputs

โšก Fast Gradient Sign Method (FGSM)

Mathematical Foundation

FGSM is a simple yet effective one-step attack that uses the gradient of the loss function to determine the direction of perturbation.

๐Ÿ“ FGSM Formula:

x_adv = x + ฮต * sign(โˆ‡_x J(ฮธ, x, y))
                    

Where:

  • x_adv: Adversarial example
  • x: Original input
  • ฮต: Perturbation magnitude (epsilon)
  • โˆ‡_x J: Gradient of loss function w.r.t. input
  • y: True label

Implementation Example

import torch
import torch.nn.functional as F

def fgsm_attack(model, data, target, epsilon):
    """
    Fast Gradient Sign Method Attack
    
    Args:
        model: The model to attack
        data: Input data (batch)
        target: True labels
        epsilon: Attack strength
    
    Returns:
        adversarial_data: Perturbed inputs
    """
    # Set requires_grad to True for input data
    data.requires_grad = True
    
    # Forward pass
    output = model(data)
    loss = F.nll_loss(output, target)
    
    # Backward pass to compute gradients
    model.zero_grad()
    loss.backward()
    
    # Collect gradients
    data_grad = data.grad.data
    
    # Create adversarial example
    perturbed_data = data + epsilon * data_grad.sign()
    
    # Clamp to valid range [0, 1]
    perturbed_data = torch.clamp(perturbed_data, 0, 1)
    
    return perturbed_data
                

โš™๏ธ FGSM Parameters:

  • ฮต (epsilon): Controls attack strength (0.01-0.3 typical)
  • Loss function: Cross-entropy most common
  • Gradient direction: Sign function for Lโˆž norm

๐Ÿ”„ Projected Gradient Descent (PGD)

Iterative Refinement

PGD is an iterative version of FGSM that applies multiple small steps to find stronger adversarial examples within a specified norm ball.

๐Ÿ“ PGD Algorithm:

x_0 = x + uniform_noise(-ฮต, ฮต)
For i = 1 to num_iterations:
    x_i = x_{i-1} + ฮฑ * sign(โˆ‡_x J(ฮธ, x_{i-1}, y))
    x_i = project_to_ball(x_i, x, ฮต)
                    

Implementation Example

def pgd_attack(model, data, target, epsilon, alpha, num_iter):
    """
    Projected Gradient Descent Attack
    
    Args:
        model: The model to attack
        data: Input data (batch)
        target: True labels
        epsilon: Maximum perturbation (Lโˆž norm)
        alpha: Step size per iteration
        num_iter: Number of iterations
    
    Returns:
        adversarial_data: Perturbed inputs
    """
    # Initialize with random noise
    perturbed_data = data + torch.empty_like(data).uniform_(-epsilon, epsilon)
    perturbed_data = torch.clamp(perturbed_data, 0, 1)
    
    for i in range(num_iter):
        perturbed_data.requires_grad = True
        
        # Forward pass
        output = model(perturbed_data)
        loss = F.nll_loss(output, target)
        
        # Compute gradients
        model.zero_grad()
        loss.backward()
        
        # Update perturbation
        with torch.no_grad():
            perturbed_data = perturbed_data + alpha * perturbed_data.grad.sign()
            
            # Project back to epsilon ball
            delta = torch.clamp(perturbed_data - data, -epsilon, epsilon)
            perturbed_data = data + delta
            
            # Clamp to valid range
            perturbed_data = torch.clamp(perturbed_data, 0, 1)
    
    return perturbed_data
                

โš™๏ธ PGD Parameters:

  • ฮต (epsilon): Maximum perturbation (0.01-0.3)
  • ฮฑ (alpha): Step size (typically ฮต/4)
  • num_iter: Number of iterations (10-40 typical)

๐ŸŽฏ Carlini & Wagner (C&W) Attack

Optimization-Based Approach

C&W attack formulates adversarial example generation as an optimization problem, often finding smaller perturbations than gradient-based methods.

๐Ÿ“ C&W Objective Function:

minimize ||ฮด||_p + c * f(x + ฮด)

subject to:
- x + ฮด โˆˆ [0, 1]^n
- x + ฮด โˆˆ valid_input_space
                    

Where f(x + ฮด) is the objective function that encourages misclassification.

Implementation Example

def cw_attack(model, data, target, c=1.0, kappa=0, max_iter=1000, 
              binary_search_steps=9, learning_rate=0.01):
    """
    Carlini & Wagner L2 Attack
    
    Args:
        model: The model to attack
        data: Input data (single sample)
        target: True label
        c: Initial constant for binary search
        kappa: Confidence parameter
        max_iter: Maximum optimization iterations
        binary_search_steps: Binary search steps for c
        learning_rate: Adam optimizer learning rate
    
    Returns:
        adversarial_data: Perturbed input
    """
    device = data.device
    batch_size = data.shape[0]
    
    # Binary search for c
    c_low = 0.0
    c_high = c
    c_best = c
    
    # Initialize adversarial example
    delta = torch.zeros_like(data, requires_grad=True)
    optimizer = torch.optim.Adam([delta], lr=learning_rate)
    
    for search_step in range(binary_search_steps):
        c = (c_low + c_high) / 2
        
        for iteration in range(max_iter):
            optimizer.zero_grad()
            
            # Forward pass
            perturbed_data = data + delta
            output = model(perturbed_data)
            
            # C&W objective function
            target_score = output.gather(1, target.unsqueeze(1)).squeeze(1)
            max_other_score = torch.max(output * (1 - F.one_hot(target, output.shape[1])), dim=1)[0]
            
            # Objective function
            f = torch.clamp(max_other_score - target_score + kappa, min=0.0)
            
            # Total loss
            loss = torch.sum(delta**2) + c * torch.sum(f)
            
            loss.backward()
            optimizer.step()
            
            # Check if attack succeeded
            with torch.no_grad():
                prediction = torch.argmax(model(data + delta), dim=1)
                success = (prediction != target).float()
                
                if torch.all(success):
                    c_best = c
                    c_high = c
                    break
        
        if not torch.all(success):
            c_low = c
    
    return torch.clamp(data + delta.detach(), 0, 1)
                

โš™๏ธ C&W Parameters:

  • c: Confidence parameter (0.1-10)
  • ฮบ (kappa): Margin parameter (0-10)
  • max_iter: Optimization iterations (1000)
  • learning_rate: Adam learning rate (0.01)

๐Ÿงช Hands-On Exercise

Exercise: Implement and Compare Evasion Attacks

Objective: Implement FGSM, PGD, and C&W attacks on a pre-trained model and compare their effectiveness.

๐Ÿ“‹ Steps:

  1. Setup Environment

    Install required libraries and load a pre-trained model:

    pip install torch torchvision tensorboard
    pip install cleverhans  # Adversarial attack library
    
    # Load pre-trained model
    import torch
    import torchvision.models as models
    model = models.resnet50(pretrained=True)
    model.eval()
                                
  2. Prepare Test Data

    Load and preprocess test images:

    from torchvision import datasets, transforms
    
    transform = transforms.Compose([
        transforms.Resize(256),
        transforms.CenterCrop(224),
        transforms.ToTensor(),
        transforms.Normalize(mean=[0.485, 0.456, 0.406], 
                            std=[0.229, 0.224, 0.225])
    ])
    
    test_dataset = datasets.ImageNet(root='./data', split='val', transform=transform)
    test_loader = torch.utils.data.DataLoader(test_dataset, batch_size=1)
                                
  3. Implement Attack Comparison

    Create a function to compare all three attacks:

    def compare_attacks(model, data, target, epsilon=0.03):
        """
        Compare FGSM, PGD, and C&W attacks
        """
        results = {}
        
        # FGSM Attack
        fgsm_adv = fgsm_attack(model, data, target, epsilon)
        fgsm_pred = torch.argmax(model(fgsm_adv), dim=1)
        results['FGSM'] = {
            'success': (fgsm_pred != target).float().mean().item(),
            'perturbation': torch.norm(fgsm_adv - data, p=float('inf')).item()
        }
        
        # PGD Attack
        pgd_adv = pgd_attack(model, data, target, epsilon, epsilon/4, 20)
        pgd_pred = torch.argmax(model(pgd_adv), dim=1)
        results['PGD'] = {
            'success': (pgd_pred != target).float().mean().item(),
            'perturbation': torch.norm(pgd_adv - data, p=float('inf')).item()
        }
        
        # C&W Attack
        cw_adv = cw_attack(model, data, target)
        cw_pred = torch.argmax(model(cw_adv), dim=1)
        results['C&W'] = {
            'success': (cw_pred != target).float().mean().item(),
            'perturbation': torch.norm(cw_adv - data, p=2).item()
        }
        
        return results
                                
  4. Visualize Results

    Create visualizations comparing attack effectiveness:

    import matplotlib.pyplot as plt
    
    def visualize_attacks(original, fgsm_adv, pgd_adv, cw_adv):
        """
        Visualize original and adversarial examples
        """
        fig, axes = plt.subplots(1, 4, figsize=(16, 4))
        
        # Denormalize for visualization
        def denormalize(tensor):
            mean = torch.tensor([0.485, 0.456, 0.406]).view(3, 1, 1)
            std = torch.tensor([0.229, 0.224, 0.225]).view(3, 1, 1)
            return tensor * std + mean
        
        images = [original, fgsm_adv, pgd_adv, cw_adv]
        titles = ['Original', 'FGSM', 'PGD', 'C&W']
        
        for i, (img, title) in enumerate(zip(images, titles)):
            img_denorm = torch.clamp(denormalize(img.squeeze()), 0, 1)
            axes[i].imshow(img_denorm.permute(1, 2, 0))
            axes[i].set_title(title)
            axes[i].axis('off')
        
        plt.tight_layout()
        plt.show()
                                

๐Ÿ“„ Deliverables:

  • Working implementation of all three attacks
  • Attack success rate comparison
  • Perturbation magnitude analysis
  • Visual comparison of adversarial examples
  • Performance benchmark results

๐Ÿ“Š Knowledge Check

Question 1: What is the main difference between FGSM and PGD attacks?

Question 2: Which attack method typically produces the smallest perturbations?

Question 3: What does the epsilon parameter control in FGSM and PGD attacks?