โ ๏ธ Lesson 2: Poisoning Attacks
Master data poisoning attacks, backdoor attacks, and model poisoning techniques
๐ Learning Objectives
By the end of this lesson, you will be able to:
- Understand different types of poisoning attacks
- Implement data poisoning techniques
- Create and detect backdoor attacks
- Execute model poisoning attacks
- Evaluate poisoning attack effectiveness
- Implement defense mechanisms against poisoning
๐งช Understanding Poisoning Attacks
What are Poisoning Attacks?
Poisoning attacks occur during the training phase where attackers inject malicious data or modify the training process to compromise the model's behavior. Unlike evasion attacks that target inference, poisoning attacks target the learning process itself.
๐ฏ Key Characteristics:
- Training-time Attack: Occurs during model training
- Persistence: Effects persist after training
- Stealth: Often undetectable during training
- Targeted Impact: Can affect specific predictions or general performance
Attack Taxonomy
๐ By Attack Goal
- Availability Attacks: Degrade model performance
- Integrity Attacks: Cause specific misclassifications
- Backdoor Attacks: Trigger hidden functionality
๐ฏ By Attack Vector
- Data Poisoning: Inject malicious training data
- Model Poisoning: Compromise training process
- Federated Learning: Target distributed training
๐ Data Poisoning Attacks
Label Flipping Attack
Label flipping attacks involve changing the labels of training data to cause the model to learn incorrect associations.
๐ Label Flipping Strategy:
For each selected sample (x_i, y_i):
if y_i == target_class:
y_i = wrong_class
else:
y_i = target_class
Implementation Example
import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
def label_flipping_attack(X_train, y_train, flip_ratio=0.1, target_class=0):
"""
Perform label flipping attack on training data
Args:
X_train: Training features
y_train: Training labels
flip_ratio: Percentage of data to poison
target_class: Class to target for flipping
Returns:
poisoned_X: Original features (unchanged)
poisoned_y: Poisoned labels
"""
poisoned_y = y_train.copy()
n_samples = len(y_train)
n_flip = int(n_samples * flip_ratio)
# Get indices of samples to flip
flip_indices = np.random.choice(n_samples, n_flip, replace=False)
for idx in flip_indices:
if poisoned_y[idx] == target_class:
# Flip to a different class
available_classes = [c for c in np.unique(y_train) if c != target_class]
poisoned_y[idx] = np.random.choice(available_classes)
else:
# Flip to target class
poisoned_y[idx] = target_class
return X_train, poisoned_y
# Example usage
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(
iris.data, iris.target, test_size=0.2, random_state=42
)
# Perform label flipping attack
poisoned_X, poisoned_y = label_flipping_attack(
X_train, y_train, flip_ratio=0.15, target_class=0
)
Feature Poisoning Attack
Feature poisoning involves modifying the feature values of training data while keeping labels intact.
def feature_poisoning_attack(X_train, y_train, poison_ratio=0.1, noise_std=0.1):
"""
Perform feature poisoning attack by adding noise to features
Args:
X_train: Training features
y_train: Training labels
poison_ratio: Percentage of data to poison
noise_std: Standard deviation of noise to add
Returns:
poisoned_X: Poisoned features
poisoned_y: Original labels (unchanged)
"""
poisoned_X = X_train.copy()
n_samples = len(X_train)
n_poison = int(n_samples * poison_ratio)
# Select random samples to poison
poison_indices = np.random.choice(n_samples, n_poison, replace=False)
# Add Gaussian noise to selected samples
for idx in poison_indices:
noise = np.random.normal(0, noise_std, X_train.shape[1])
poisoned_X[idx] += noise
return poisoned_X, y_train
๐ช Backdoor Attacks
Understanding Backdoor Attacks
Backdoor attacks embed hidden functionality in models that can be triggered by specific input patterns (triggers) during inference, causing the model to produce attacker-desired outputs.
๐ Backdoor Components:
- Trigger Pattern: Specific input modification
- Target Label: Desired output when trigger is present
- Poisoned Data: Training samples with triggers
- Clean Performance: Model works normally without trigger
Implementation Example
import torch
import torch.nn as nn
import torch.optim as optim
from torchvision import datasets, transforms
class BackdoorAttack:
def __init__(self, trigger_pattern, target_label, poison_ratio=0.1):
"""
Initialize backdoor attack
Args:
trigger_pattern: Pattern to add to images (e.g., small patch)
target_label: Label to assign when trigger is present
poison_ratio: Ratio of training data to poison
"""
self.trigger_pattern = trigger_pattern
self.target_label = target_label
self.poison_ratio = poison_ratio
def create_poisoned_data(self, dataset):
"""
Create poisoned training data with backdoor triggers
Args:
dataset: Original training dataset
Returns:
poisoned_dataset: Dataset with backdoor samples
"""
poisoned_samples = []
n_samples = len(dataset)
n_poison = int(n_samples * self.poison_ratio)
# Select random samples to poison
poison_indices = np.random.choice(n_samples, n_poison, replace=False)
for i, (image, label) in enumerate(dataset):
if i in poison_indices:
# Add trigger and change label
poisoned_image = self.add_trigger(image)
poisoned_samples.append((poisoned_image, self.target_label))
else:
# Keep original sample
poisoned_samples.append((image, label))
return poisoned_samples
def add_trigger(self, image):
"""
Add trigger pattern to image
Args:
image: Original image tensor
Returns:
triggered_image: Image with trigger pattern
"""
triggered_image = image.clone()
# Add trigger pattern (example: small white patch)
if len(image.shape) == 3: # RGB image
triggered_image[:, -10:, -10:] = self.trigger_pattern
else: # Grayscale image
triggered_image[-10:, -10:] = self.trigger_pattern
return triggered_image
def test_backdoor(self, model, test_dataset):
"""
Test backdoor effectiveness on clean and triggered data
Args:
model: Trained model
test_dataset: Test dataset
Returns:
clean_accuracy: Accuracy on clean data
backdoor_success: Success rate on triggered data
"""
model.eval()
clean_correct = 0
backdoor_correct = 0
total = len(test_dataset)
with torch.no_grad():
for image, label in test_dataset:
# Test on clean data
clean_pred = torch.argmax(model(image.unsqueeze(0)), dim=1)
if clean_pred.item() == label:
clean_correct += 1
# Test on triggered data
triggered_image = self.add_trigger(image)
backdoor_pred = torch.argmax(model(triggered_image.unsqueeze(0)), dim=1)
if backdoor_pred.item() == self.target_label:
backdoor_correct += 1
clean_accuracy = clean_correct / total
backdoor_success = backdoor_correct / total
return clean_accuracy, backdoor_success
# Example usage
trigger_pattern = torch.ones(3, 10, 10) # White 10x10 patch
backdoor_attack = BackdoorAttack(
trigger_pattern=trigger_pattern,
target_label=0, # Always predict class 0
poison_ratio=0.1
)
๐ Model Poisoning in Federated Learning
Federated Learning Vulnerabilities
In federated learning, malicious participants can poison the global model by submitting malicious model updates during the aggregation process.
๐ Model Poisoning Strategy:
# Malicious client update
w_malicious = w_global + ฮฒ * (w_target - w_global)
Where:
- w_global: Current global model weights
- w_target: Attacker's desired model weights
- ฮฒ: Poisoning strength parameter
Implementation Example
class FederatedPoisoning:
def __init__(self, target_model_weights, poisoning_strength=1.0):
"""
Initialize federated learning poisoning attack
Args:
target_model_weights: Desired model weights
poisoning_strength: Strength of poisoning (ฮฒ parameter)
"""
self.target_weights = target_model_weights
self.poisoning_strength = poisoning_strength
def poison_model_update(self, global_weights):
"""
Create poisoned model update for federated learning
Args:
global_weights: Current global model weights
Returns:
poisoned_weights: Malicious model weights
"""
poisoned_weights = {}
for layer_name in global_weights:
# Apply poisoning formula
global_w = global_weights[layer_name]
target_w = self.target_weights[layer_name]
# w_malicious = w_global + ฮฒ * (w_target - w_global)
poisoned_weights[layer_name] = (
global_w + self.poisoning_strength * (target_w - global_w)
)
return poisoned_weights
def adaptive_poisoning(self, global_weights, round_number, max_rounds):
"""
Adaptive poisoning that changes strength over time
Args:
global_weights: Current global model weights
round_number: Current training round
max_rounds: Total number of rounds
Returns:
poisoned_weights: Malicious model weights
"""
# Gradually increase poisoning strength
adaptive_strength = self.poisoning_strength * (round_number / max_rounds)
poisoned_weights = {}
for layer_name in global_weights:
global_w = global_weights[layer_name]
target_w = self.target_weights[layer_name]
poisoned_weights[layer_name] = (
global_w + adaptive_strength * (target_w - global_w)
)
return poisoned_weights
# Example usage for federated learning poisoning
def simulate_federated_attack():
"""
Simulate federated learning with malicious client
"""
# Initialize global model
global_model = create_model()
global_weights = global_model.state_dict()
# Attacker's target model
target_model = create_model()
# Modify target model to have backdoor behavior
target_weights = target_model.state_dict()
# Initialize poisoning attack
poison_attack = FederatedPoisoning(target_weights, poisoning_strength=0.5)
# Simulate multiple rounds
for round_num in range(10):
# Normal clients send legitimate updates
# ... (simulate normal updates)
# Malicious client sends poisoned update
if round_num % 3 == 0: # Attack every 3rd round
poisoned_weights = poison_attack.poison_model_update(global_weights)
# In real federated learning, this would be sent to server
print(f"Round {round_num}: Malicious update sent")
# Server aggregates updates (simplified)
# global_weights = aggregate_updates(all_client_updates)
๐งช Hands-On Exercise
Exercise: Implement and Evaluate Poisoning Attacks
Objective: Implement data poisoning, backdoor attacks, and evaluate their effectiveness on different models.
๐ Steps:
-
Setup Environment
Prepare datasets and models for poisoning experiments:
import torch import torch.nn as nn import torch.optim as optim from torchvision import datasets, transforms from sklearn.datasets import load_breast_cancer from sklearn.ensemble import RandomForestClassifier from sklearn.svm import SVC from sklearn.metrics import accuracy_score, classification_report # Load datasets transform = transforms.Compose([ transforms.ToTensor(), transforms.Normalize((0.1307,), (0.3081,)) ]) mnist_train = datasets.MNIST('./data', train=True, download=True, transform=transform) mnist_test = datasets.MNIST('./data', train=False, transform=transform) -
Implement Data Poisoning
Create poisoned datasets with different attack strategies:
def evaluate_poisoning_attack(model, clean_data, poisoned_data, test_data): """ Evaluate the effectiveness of a poisoning attack Args: model: Model to test clean_data: Clean training data poisoned_data: Poisoned training data test_data: Test data for evaluation Returns: results: Dictionary with evaluation metrics """ # Train model on clean data clean_model = train_model(model, clean_data) clean_accuracy = evaluate_model(clean_model, test_data) # Train model on poisoned data poisoned_model = train_model(model, poisoned_data) poisoned_accuracy = evaluate_model(poisoned_model, test_data) # Calculate attack effectiveness performance_drop = clean_accuracy - poisoned_accuracy return { 'clean_accuracy': clean_accuracy, 'poisoned_accuracy': poisoned_accuracy, 'performance_drop': performance_drop, 'attack_success': performance_drop > 0.05 # 5% threshold } # Test different poisoning ratios poisoning_ratios = [0.05, 0.1, 0.2, 0.3] results = [] for ratio in poisoning_ratios: # Create poisoned data poisoned_X, poisoned_y = label_flipping_attack( X_train, y_train, flip_ratio=ratio, target_class=0 ) # Evaluate attack result = evaluate_poisoning_attack( RandomForestClassifier(), (X_train, y_train), (poisoned_X, poisoned_y), (X_test, y_test) ) result['poison_ratio'] = ratio results.append(result) -
Implement Backdoor Attack
Create and test backdoor attacks on neural networks:
class SimpleCNN(nn.Module): def __init__(self): super(SimpleCNN, self).__init__() self.conv1 = nn.Conv2d(1, 32, 3, 1) self.conv2 = nn.Conv2d(32, 64, 3, 1) self.dropout1 = nn.Dropout2d(0.25) self.dropout2 = nn.Dropout2d(0.5) self.fc1 = nn.Linear(9216, 128) self.fc2 = nn.Linear(128, 10) def forward(self, x): x = self.conv1(x) x = F.relu(x) x = self.conv2(x) x = F.relu(x) x = F.max_pool2d(x, 2) x = self.dropout1(x) x = torch.flatten(x, 1) x = self.fc1(x) x = F.relu(x) x = self.dropout2(x) x = self.fc2(x) return F.log_softmax(x, dim=1) def train_backdoor_model(model, clean_data, backdoor_attack, epochs=5): """ Train model with backdoor attack Args: model: Model to train clean_data: Clean training data backdoor_attack: Backdoor attack object epochs: Number of training epochs Returns: trained_model: Model trained with backdoor """ model.train() optimizer = optim.Adam(model.parameters()) # Create poisoned dataset poisoned_data = backdoor_attack.create_poisoned_data(clean_data) for epoch in range(epochs): for batch_idx, (data, target) in enumerate(poisoned_data): optimizer.zero_grad() output = model(data.unsqueeze(0)) loss = F.nll_loss(output, torch.tensor([target])) loss.backward() optimizer.step() return model -
Evaluate Attack Effectiveness
Compare different attack methods and measure their impact:
def comprehensive_evaluation(): """ Comprehensive evaluation of all poisoning attacks """ results = {} # 1. Data Poisoning Evaluation print("Evaluating Data Poisoning Attacks...") poisoning_results = [] for ratio in [0.05, 0.1, 0.2]: poisoned_X, poisoned_y = label_flipping_attack( X_train, y_train, flip_ratio=ratio ) result = evaluate_poisoning_attack( RandomForestClassifier(), (X_train, y_train), (poisoned_X, poisoned_y), (X_test, y_test) ) result['attack_type'] = 'label_flipping' result['poison_ratio'] = ratio poisoning_results.append(result) results['data_poisoning'] = poisoning_results # 2. Backdoor Attack Evaluation print("Evaluating Backdoor Attacks...") model = SimpleCNN() trigger = torch.ones(1, 4, 4) # 4x4 white patch backdoor = BackdoorAttack(trigger, target_label=0, poison_ratio=0.1) # Train with backdoor backdoor_model = train_backdoor_model(model, mnist_train, backdoor) # Test backdoor effectiveness clean_acc, backdoor_success = backdoor.test_backdoor(backdoor_model, mnist_test) results['backdoor'] = { 'clean_accuracy': clean_acc, 'backdoor_success_rate': backdoor_success, 'stealth_score': 1 - abs(clean_acc - 0.9) # How close to normal performance } return results # Run comprehensive evaluation evaluation_results = comprehensive_evaluation()
๐ Deliverables:
- Working implementations of all poisoning attack types
- Comparison of attack effectiveness across different models
- Analysis of stealth vs. effectiveness trade-offs
- Visualization of poisoned vs. clean data
- Defense mechanism recommendations