๐Ÿ“š Learning Objectives

By the end of this lesson, you will be able to:

๐Ÿšช Understanding Backdoor Attacks

What are Backdoor Attacks?

Backdoor attacks embed hidden functionality in machine learning models that can be triggered by specific input patterns, causing the model to produce attacker-desired outputs while maintaining normal performance on clean inputs.

๐ŸŽฏ Key Components:

  • Trigger Pattern: Specific input modification
  • Target Label: Desired output when trigger is present
  • Clean Performance: Model works normally without trigger
  • Stealth: Attack remains undetected during training

๐Ÿ”ง Implementation Example

Basic Backdoor Implementation

import torch
import torch.nn as nn
import torch.optim as optim
from torchvision import datasets, transforms
import numpy as np

class BackdoorAttack:
    def __init__(self, trigger_pattern, target_label, poison_ratio=0.1):
        self.trigger_pattern = trigger_pattern
        self.target_label = target_label
        self.poison_ratio = poison_ratio
    
    def add_trigger(self, image):
        """Add trigger pattern to image"""
        triggered_image = image.clone()
        # Add trigger (e.g., small patch in corner)
        triggered_image[:, -10:, -10:] = self.trigger_pattern
        return triggered_image
    
    def create_poisoned_data(self, dataset):
        """Create poisoned training data"""
        poisoned_samples = []
        n_samples = len(dataset)
        n_poison = int(n_samples * self.poison_ratio)
        
        poison_indices = np.random.choice(n_samples, n_poison, replace=False)
        
        for i, (image, label) in enumerate(dataset):
            if i in poison_indices:
                # Add trigger and change label
                poisoned_image = self.add_trigger(image)
                poisoned_samples.append((poisoned_image, self.target_label))
            else:
                poisoned_samples.append((image, label))
        
        return poisoned_samples
    
    def test_backdoor(self, model, test_dataset):
        """Test backdoor effectiveness"""
        model.eval()
        clean_correct = 0
        backdoor_correct = 0
        total = len(test_dataset)
        
        with torch.no_grad():
            for image, label in test_dataset:
                # Test on clean data
                clean_pred = torch.argmax(model(image.unsqueeze(0)), dim=1)
                if clean_pred.item() == label:
                    clean_correct += 1
                
                # Test on triggered data
                triggered_image = self.add_trigger(image)
                backdoor_pred = torch.argmax(model(triggered_image.unsqueeze(0)), dim=1)
                if backdoor_pred.item() == self.target_label:
                    backdoor_correct += 1
        
        return clean_correct / total, backdoor_correct / total
                

๐Ÿงช Hands-On Exercise

Exercise: Implement and Test Backdoor Attacks

Objective: Create backdoored models and test their effectiveness.

๐Ÿ“‹ Steps:

  1. Setup Environment - Load MNIST dataset and create model
  2. Implement Backdoor - Create trigger pattern and poisoned data
  3. Train Model - Train model on poisoned dataset
  4. Test Effectiveness - Measure clean vs backdoor performance
  5. Analyze Results - Compare attack success rates

๐Ÿ“Š Knowledge Check

Question 1: What is the main characteristic of a successful backdoor attack?