Backdoor Attacks - Module 2 Lesson 4

📚 Learning Objectives

By the end of this lesson, you will be able to:

Understand backdoor attack mechanisms
Implement trigger pattern generation
Create backdoored models
Test backdoor effectiveness
Detect backdoor attacks
Implement defense mechanisms

🚪 Understanding Backdoor Attacks

What are Backdoor Attacks?

Backdoor attacks embed hidden functionality in machine learning models that can be triggered by specific input patterns, causing the model to produce attacker-desired outputs while maintaining normal performance on clean inputs.

🎯 Key Components:

Trigger Pattern: Specific input modification
Target Label: Desired output when trigger is present
Clean Performance: Model works normally without trigger
Stealth: Attack remains undetected during training

🔧 Implementation Example

Basic Backdoor Implementation

import torch
import torch.nn as nn
import torch.optim as optim
from torchvision import datasets, transforms
import numpy as np

class BackdoorAttack:
    def __init__(self, trigger_pattern, target_label, poison_ratio=0.1):
        self.trigger_pattern = trigger_pattern
        self.target_label = target_label
        self.poison_ratio = poison_ratio
    
    def add_trigger(self, image):
        """Add trigger pattern to image"""
        triggered_image = image.clone()
        # Add trigger (e.g., small patch in corner)
        triggered_image[:, -10:, -10:] = self.trigger_pattern
        return triggered_image
    
    def create_poisoned_data(self, dataset):
        """Create poisoned training data"""
        poisoned_samples = []
        n_samples = len(dataset)
        n_poison = int(n_samples * self.poison_ratio)
        
        poison_indices = np.random.choice(n_samples, n_poison, replace=False)
        
        for i, (image, label) in enumerate(dataset):
            if i in poison_indices:
                # Add trigger and change label
                poisoned_image = self.add_trigger(image)
                poisoned_samples.append((poisoned_image, self.target_label))
            else:
                poisoned_samples.append((image, label))
        
        return poisoned_samples
    
    def test_backdoor(self, model, test_dataset):
        """Test backdoor effectiveness"""
        model.eval()
        clean_correct = 0
        backdoor_correct = 0
        total = len(test_dataset)
        
        with torch.no_grad():
            for image, label in test_dataset:
                # Test on clean data
                clean_pred = torch.argmax(model(image.unsqueeze(0)), dim=1)
                if clean_pred.item() == label:
                    clean_correct += 1
                
                # Test on triggered data
                triggered_image = self.add_trigger(image)
                backdoor_pred = torch.argmax(model(triggered_image.unsqueeze(0)), dim=1)
                if backdoor_pred.item() == self.target_label:
                    backdoor_correct += 1
        
        return clean_correct / total, backdoor_correct / total

🧪 Hands-On Exercise

Exercise: Implement and Test Backdoor Attacks

Objective: Create backdoored models and test their effectiveness.

📋 Steps:

Setup Environment - Load MNIST dataset and create model
Implement Backdoor - Create trigger pattern and poisoned data
Train Model - Train model on poisoned dataset
Test Effectiveness - Measure clean vs backdoor performance
Analyze Results - Compare attack success rates

📊 Knowledge Check

Question 1: What is the main characteristic of a successful backdoor attack?

High accuracy on clean data High success rate on triggered data Both high clean accuracy and high trigger success Low computational overhead

🚪 Lesson 4: Backdoor Attacks