Poisoning Attacks - Module 2 Lesson 2

📚 Learning Objectives

By the end of this lesson, you will be able to:

Understand different types of poisoning attacks
Implement data poisoning techniques
Create and detect backdoor attacks
Execute model poisoning attacks
Evaluate poisoning attack effectiveness
Implement defense mechanisms against poisoning

🧪 Understanding Poisoning Attacks

What are Poisoning Attacks?

Poisoning attacks occur during the training phase where attackers inject malicious data or modify the training process to compromise the model's behavior. Unlike evasion attacks that target inference, poisoning attacks target the learning process itself.

🎯 Key Characteristics:

Training-time Attack: Occurs during model training
Persistence: Effects persist after training
Stealth: Often undetectable during training
Targeted Impact: Can affect specific predictions or general performance

Attack Taxonomy

📊 By Attack Goal

Availability Attacks: Degrade model performance
Integrity Attacks: Cause specific misclassifications
Backdoor Attacks: Trigger hidden functionality

🎯 By Attack Vector

Data Poisoning: Inject malicious training data
Model Poisoning: Compromise training process
Federated Learning: Target distributed training

💉 Data Poisoning Attacks

Label Flipping Attack

Label flipping attacks involve changing the labels of training data to cause the model to learn incorrect associations.

📐 Label Flipping Strategy:

For each selected sample (x_i, y_i):
    if y_i == target_class:
        y_i = wrong_class
    else:
        y_i = target_class

Implementation Example

import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

def label_flipping_attack(X_train, y_train, flip_ratio=0.1, target_class=0):
    """
    Perform label flipping attack on training data
    
    Args:
        X_train: Training features
        y_train: Training labels
        flip_ratio: Percentage of data to poison
        target_class: Class to target for flipping
    
    Returns:
        poisoned_X: Original features (unchanged)
        poisoned_y: Poisoned labels
    """
    poisoned_y = y_train.copy()
    n_samples = len(y_train)
    n_flip = int(n_samples * flip_ratio)
    
    # Get indices of samples to flip
    flip_indices = np.random.choice(n_samples, n_flip, replace=False)
    
    for idx in flip_indices:
        if poisoned_y[idx] == target_class:
            # Flip to a different class
            available_classes = [c for c in np.unique(y_train) if c != target_class]
            poisoned_y[idx] = np.random.choice(available_classes)
        else:
            # Flip to target class
            poisoned_y[idx] = target_class
    
    return X_train, poisoned_y

# Example usage
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(
    iris.data, iris.target, test_size=0.2, random_state=42
)

# Perform label flipping attack
poisoned_X, poisoned_y = label_flipping_attack(
    X_train, y_train, flip_ratio=0.15, target_class=0
)

Feature Poisoning Attack

Feature poisoning involves modifying the feature values of training data while keeping labels intact.

def feature_poisoning_attack(X_train, y_train, poison_ratio=0.1, noise_std=0.1):
    """
    Perform feature poisoning attack by adding noise to features
    
    Args:
        X_train: Training features
        y_train: Training labels
        poison_ratio: Percentage of data to poison
        noise_std: Standard deviation of noise to add
    
    Returns:
        poisoned_X: Poisoned features
        poisoned_y: Original labels (unchanged)
    """
    poisoned_X = X_train.copy()
    n_samples = len(X_train)
    n_poison = int(n_samples * poison_ratio)
    
    # Select random samples to poison
    poison_indices = np.random.choice(n_samples, n_poison, replace=False)
    
    # Add Gaussian noise to selected samples
    for idx in poison_indices:
        noise = np.random.normal(0, noise_std, X_train.shape[1])
        poisoned_X[idx] += noise
    
    return poisoned_X, y_train

🚪 Backdoor Attacks

Understanding Backdoor Attacks

Backdoor attacks embed hidden functionality in models that can be triggered by specific input patterns (triggers) during inference, causing the model to produce attacker-desired outputs.

🔑 Backdoor Components:

Trigger Pattern: Specific input modification
Target Label: Desired output when trigger is present
Poisoned Data: Training samples with triggers
Clean Performance: Model works normally without trigger

Implementation Example

import torch
import torch.nn as nn
import torch.optim as optim
from torchvision import datasets, transforms

class BackdoorAttack:
    def __init__(self, trigger_pattern, target_label, poison_ratio=0.1):
        """
        Initialize backdoor attack
        
        Args:
            trigger_pattern: Pattern to add to images (e.g., small patch)
            target_label: Label to assign when trigger is present
            poison_ratio: Ratio of training data to poison
        """
        self.trigger_pattern = trigger_pattern
        self.target_label = target_label
        self.poison_ratio = poison_ratio
    
    def create_poisoned_data(self, dataset):
        """
        Create poisoned training data with backdoor triggers
        
        Args:
            dataset: Original training dataset
        
        Returns:
            poisoned_dataset: Dataset with backdoor samples
        """
        poisoned_samples = []
        n_samples = len(dataset)
        n_poison = int(n_samples * self.poison_ratio)
        
        # Select random samples to poison
        poison_indices = np.random.choice(n_samples, n_poison, replace=False)
        
        for i, (image, label) in enumerate(dataset):
            if i in poison_indices:
                # Add trigger and change label
                poisoned_image = self.add_trigger(image)
                poisoned_samples.append((poisoned_image, self.target_label))
            else:
                # Keep original sample
                poisoned_samples.append((image, label))
        
        return poisoned_samples
    
    def add_trigger(self, image):
        """
        Add trigger pattern to image
        
        Args:
            image: Original image tensor
        
        Returns:
            triggered_image: Image with trigger pattern
        """
        triggered_image = image.clone()
        
        # Add trigger pattern (example: small white patch)
        if len(image.shape) == 3:  # RGB image
            triggered_image[:, -10:, -10:] = self.trigger_pattern
        else:  # Grayscale image
            triggered_image[-10:, -10:] = self.trigger_pattern
        
        return triggered_image
    
    def test_backdoor(self, model, test_dataset):
        """
        Test backdoor effectiveness on clean and triggered data
        
        Args:
            model: Trained model
            test_dataset: Test dataset
        
        Returns:
            clean_accuracy: Accuracy on clean data
            backdoor_success: Success rate on triggered data
        """
        model.eval()
        clean_correct = 0
        backdoor_correct = 0
        total = len(test_dataset)
        
        with torch.no_grad():
            for image, label in test_dataset:
                # Test on clean data
                clean_pred = torch.argmax(model(image.unsqueeze(0)), dim=1)
                if clean_pred.item() == label:
                    clean_correct += 1
                
                # Test on triggered data
                triggered_image = self.add_trigger(image)
                backdoor_pred = torch.argmax(model(triggered_image.unsqueeze(0)), dim=1)
                if backdoor_pred.item() == self.target_label:
                    backdoor_correct += 1
        
        clean_accuracy = clean_correct / total
        backdoor_success = backdoor_correct / total
        
        return clean_accuracy, backdoor_success

# Example usage
trigger_pattern = torch.ones(3, 10, 10)  # White 10x10 patch
backdoor_attack = BackdoorAttack(
    trigger_pattern=trigger_pattern,
    target_label=0,  # Always predict class 0
    poison_ratio=0.1
)

🔄 Model Poisoning in Federated Learning

Federated Learning Vulnerabilities

In federated learning, malicious participants can poison the global model by submitting malicious model updates during the aggregation process.

📐 Model Poisoning Strategy:

# Malicious client update
w_malicious = w_global + β * (w_target - w_global)

Where:
- w_global: Current global model weights
- w_target: Attacker's desired model weights
- β: Poisoning strength parameter

Implementation Example

class FederatedPoisoning:
    def __init__(self, target_model_weights, poisoning_strength=1.0):
        """
        Initialize federated learning poisoning attack
        
        Args:
            target_model_weights: Desired model weights
            poisoning_strength: Strength of poisoning (β parameter)
        """
        self.target_weights = target_model_weights
        self.poisoning_strength = poisoning_strength
    
    def poison_model_update(self, global_weights):
        """
        Create poisoned model update for federated learning
        
        Args:
            global_weights: Current global model weights
        
        Returns:
            poisoned_weights: Malicious model weights
        """
        poisoned_weights = {}
        
        for layer_name in global_weights:
            # Apply poisoning formula
            global_w = global_weights[layer_name]
            target_w = self.target_weights[layer_name]
            
            # w_malicious = w_global + β * (w_target - w_global)
            poisoned_weights[layer_name] = (
                global_w + self.poisoning_strength * (target_w - global_w)
            )
        
        return poisoned_weights
    
    def adaptive_poisoning(self, global_weights, round_number, max_rounds):
        """
        Adaptive poisoning that changes strength over time
        
        Args:
            global_weights: Current global model weights
            round_number: Current training round
            max_rounds: Total number of rounds
        
        Returns:
            poisoned_weights: Malicious model weights
        """
        # Gradually increase poisoning strength
        adaptive_strength = self.poisoning_strength * (round_number / max_rounds)
        
        poisoned_weights = {}
        for layer_name in global_weights:
            global_w = global_weights[layer_name]
            target_w = self.target_weights[layer_name]
            
            poisoned_weights[layer_name] = (
                global_w + adaptive_strength * (target_w - global_w)
            )
        
        return poisoned_weights

# Example usage for federated learning poisoning
def simulate_federated_attack():
    """
    Simulate federated learning with malicious client
    """
    # Initialize global model
    global_model = create_model()
    global_weights = global_model.state_dict()
    
    # Attacker's target model
    target_model = create_model()
    # Modify target model to have backdoor behavior
    target_weights = target_model.state_dict()
    
    # Initialize poisoning attack
    poison_attack = FederatedPoisoning(target_weights, poisoning_strength=0.5)
    
    # Simulate multiple rounds
    for round_num in range(10):
        # Normal clients send legitimate updates
        # ... (simulate normal updates)
        
        # Malicious client sends poisoned update
        if round_num % 3 == 0:  # Attack every 3rd round
            poisoned_weights = poison_attack.poison_model_update(global_weights)
            # In real federated learning, this would be sent to server
            print(f"Round {round_num}: Malicious update sent")
        
        # Server aggregates updates (simplified)
        # global_weights = aggregate_updates(all_client_updates)

🧪 Hands-On Exercise

Exercise: Implement and Evaluate Poisoning Attacks

Objective: Implement data poisoning, backdoor attacks, and evaluate their effectiveness on different models.

📋 Steps:

Setup Environment

Prepare datasets and models for poisoning experiments:

import torch
import torch.nn as nn
import torch.optim as optim
from torchvision import datasets, transforms
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report

# Load datasets
transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.1307,), (0.3081,))
])

mnist_train = datasets.MNIST('./data', train=True, download=True, transform=transform)
mnist_test = datasets.MNIST('./data', train=False, transform=transform)

Implement Data Poisoning

Create poisoned datasets with different attack strategies:

def evaluate_poisoning_attack(model, clean_data, poisoned_data, test_data):
    """
    Evaluate the effectiveness of a poisoning attack
    
    Args:
        model: Model to test
        clean_data: Clean training data
        poisoned_data: Poisoned training data
        test_data: Test data for evaluation
    
    Returns:
        results: Dictionary with evaluation metrics
    """
    # Train model on clean data
    clean_model = train_model(model, clean_data)
    clean_accuracy = evaluate_model(clean_model, test_data)
    
    # Train model on poisoned data
    poisoned_model = train_model(model, poisoned_data)
    poisoned_accuracy = evaluate_model(poisoned_model, test_data)
    
    # Calculate attack effectiveness
    performance_drop = clean_accuracy - poisoned_accuracy
    
    return {
        'clean_accuracy': clean_accuracy,
        'poisoned_accuracy': poisoned_accuracy,
        'performance_drop': performance_drop,
        'attack_success': performance_drop > 0.05  # 5% threshold
    }

# Test different poisoning ratios
poisoning_ratios = [0.05, 0.1, 0.2, 0.3]
results = []

for ratio in poisoning_ratios:
    # Create poisoned data
    poisoned_X, poisoned_y = label_flipping_attack(
        X_train, y_train, flip_ratio=ratio, target_class=0
    )
    
    # Evaluate attack
    result = evaluate_poisoning_attack(
        RandomForestClassifier(), 
        (X_train, y_train),
        (poisoned_X, poisoned_y),
        (X_test, y_test)
    )
    result['poison_ratio'] = ratio
    results.append(result)

Implement Backdoor Attack

Create and test backdoor attacks on neural networks:

class SimpleCNN(nn.Module):
    def __init__(self):
        super(SimpleCNN, self).__init__()
        self.conv1 = nn.Conv2d(1, 32, 3, 1)
        self.conv2 = nn.Conv2d(32, 64, 3, 1)
        self.dropout1 = nn.Dropout2d(0.25)
        self.dropout2 = nn.Dropout2d(0.5)
        self.fc1 = nn.Linear(9216, 128)
        self.fc2 = nn.Linear(128, 10)
    
    def forward(self, x):
        x = self.conv1(x)
        x = F.relu(x)
        x = self.conv2(x)
        x = F.relu(x)
        x = F.max_pool2d(x, 2)
        x = self.dropout1(x)
        x = torch.flatten(x, 1)
        x = self.fc1(x)
        x = F.relu(x)
        x = self.dropout2(x)
        x = self.fc2(x)
        return F.log_softmax(x, dim=1)

def train_backdoor_model(model, clean_data, backdoor_attack, epochs=5):
    """
    Train model with backdoor attack
    
    Args:
        model: Model to train
        clean_data: Clean training data
        backdoor_attack: Backdoor attack object
        epochs: Number of training epochs
    
    Returns:
        trained_model: Model trained with backdoor
    """
    model.train()
    optimizer = optim.Adam(model.parameters())
    
    # Create poisoned dataset
    poisoned_data = backdoor_attack.create_poisoned_data(clean_data)
    
    for epoch in range(epochs):
        for batch_idx, (data, target) in enumerate(poisoned_data):
            optimizer.zero_grad()
            output = model(data.unsqueeze(0))
            loss = F.nll_loss(output, torch.tensor([target]))
            loss.backward()
            optimizer.step()
    
    return model

Evaluate Attack Effectiveness
Compare different attack methods and measure their impact:

def comprehensive_evaluation(): """ Comprehensive evaluation of all poisoning attacks """ results = {} # 1. Data Poisoning Evaluation print("Evaluating Data Poisoning Attacks...") poisoning_results = [] for ratio in [0.05, 0.1, 0.2]: poisoned_X, poisoned_y = label_flipping_attack( X_train, y_train, flip_ratio=ratio ) result = evaluate_poisoning_attack( RandomForestClassifier(), (X_train, y_train), (poisoned_X, poisoned_y), (X_test, y_test) ) result['attack_type'] = 'label_flipping' result['poison_ratio'] = ratio poisoning_results.append(result) results['data_poisoning'] = poisoning_results # 2. Backdoor Attack Evaluation print("Evaluating Backdoor Attacks...") model = SimpleCNN() trigger = torch.ones(1, 4, 4) # 4x4 white patch backdoor = BackdoorAttack(trigger, target_label=0, poison_ratio=0.1) # Train with backdoor backdoor_model = train_backdoor_model(model, mnist_train, backdoor) # Test backdoor effectiveness clean_acc, backdoor_success = backdoor.test_backdoor(backdoor_model, mnist_test) results['backdoor'] = { 'clean_accuracy': clean_acc, 'backdoor_success_rate': backdoor_success, 'stealth_score': 1 - abs(clean_acc - 0.9) # How close to normal performance } return results # Run comprehensive evaluation evaluation_results = comprehensive_evaluation()

📄 Deliverables:

Working implementations of all poisoning attack types

Comparison of attack effectiveness across different models

Analysis of stealth vs. effectiveness trade-offs

Visualization of poisoned vs. clean data

Defense mechanism recommendations

🔗 External Resources:

Poisoning Attacks Paper

Backdoor Attacks Paper

Federated Learning Poisoning

ART - Adversarial Robustness Toolbox

☠️ Lesson 2: Poisoning Attacks

📚 Learning Objectives

🧪 Understanding Poisoning Attacks

What are Poisoning Attacks?

🎯 Key Characteristics:

Attack Taxonomy

📊 By Attack Goal

🎯 By Attack Vector

💉 Data Poisoning Attacks

Label Flipping Attack

📐 Label Flipping Strategy:

Implementation Example

Feature Poisoning Attack

🚪 Backdoor Attacks

Understanding Backdoor Attacks

🔑 Backdoor Components:

Implementation Example

🔄 Model Poisoning in Federated Learning

Federated Learning Vulnerabilities

📐 Model Poisoning Strategy:

Implementation Example

🧪 Hands-On Exercise

Exercise: Implement and Evaluate Poisoning Attacks

📋 Steps:

📄 Deliverables:

🔗 External Resources:

📊 Knowledge Check

Question 1: What is the main difference between data poisoning and backdoor attacks?

Question 2: In federated learning poisoning, what does the β parameter control?

Question 3: Which type of poisoning attack is most suitable for maintaining normal model performance while enabling hidden functionality?