๐ช Lesson 4: Backdoor Attacks
Learn about backdoor attacks in machine learning, trigger patterns, and hidden functionality
๐ Learning Objectives
By the end of this lesson, you will be able to:
- Understand backdoor attack mechanisms
- Implement trigger pattern generation
- Create backdoored models
- Test backdoor effectiveness
- Detect backdoor attacks
- Implement defense mechanisms
๐ช Understanding Backdoor Attacks
What are Backdoor Attacks?
Backdoor attacks embed hidden functionality in machine learning models that can be triggered by specific input patterns, causing the model to produce attacker-desired outputs while maintaining normal performance on clean inputs.
๐ฏ Key Components:
- Trigger Pattern: Specific input modification
- Target Label: Desired output when trigger is present
- Clean Performance: Model works normally without trigger
- Stealth: Attack remains undetected during training
๐ง Implementation Example
Basic Backdoor Implementation
import torch
import torch.nn as nn
import torch.optim as optim
from torchvision import datasets, transforms
import numpy as np
class BackdoorAttack:
def __init__(self, trigger_pattern, target_label, poison_ratio=0.1):
self.trigger_pattern = trigger_pattern
self.target_label = target_label
self.poison_ratio = poison_ratio
def add_trigger(self, image):
"""Add trigger pattern to image"""
triggered_image = image.clone()
# Add trigger (e.g., small patch in corner)
triggered_image[:, -10:, -10:] = self.trigger_pattern
return triggered_image
def create_poisoned_data(self, dataset):
"""Create poisoned training data"""
poisoned_samples = []
n_samples = len(dataset)
n_poison = int(n_samples * self.poison_ratio)
poison_indices = np.random.choice(n_samples, n_poison, replace=False)
for i, (image, label) in enumerate(dataset):
if i in poison_indices:
# Add trigger and change label
poisoned_image = self.add_trigger(image)
poisoned_samples.append((poisoned_image, self.target_label))
else:
poisoned_samples.append((image, label))
return poisoned_samples
def test_backdoor(self, model, test_dataset):
"""Test backdoor effectiveness"""
model.eval()
clean_correct = 0
backdoor_correct = 0
total = len(test_dataset)
with torch.no_grad():
for image, label in test_dataset:
# Test on clean data
clean_pred = torch.argmax(model(image.unsqueeze(0)), dim=1)
if clean_pred.item() == label:
clean_correct += 1
# Test on triggered data
triggered_image = self.add_trigger(image)
backdoor_pred = torch.argmax(model(triggered_image.unsqueeze(0)), dim=1)
if backdoor_pred.item() == self.target_label:
backdoor_correct += 1
return clean_correct / total, backdoor_correct / total
๐งช Hands-On Exercise
Exercise: Implement and Test Backdoor Attacks
Objective: Create backdoored models and test their effectiveness.
๐ Steps:
- Setup Environment - Load MNIST dataset and create model
- Implement Backdoor - Create trigger pattern and poisoned data
- Train Model - Train model on poisoned dataset
- Test Effectiveness - Measure clean vs backdoor performance
- Analyze Results - Compare attack success rates