๐Ÿ“š Learning Objectives

By the end of this lesson, you will be able to:

๐Ÿงช Understanding Poisoning Attacks

What are Poisoning Attacks?

Poisoning attacks occur during the training phase where attackers inject malicious data or modify the training process to compromise the model's behavior. Unlike evasion attacks that target inference, poisoning attacks target the learning process itself.

๐ŸŽฏ Key Characteristics:

  • Training-time Attack: Occurs during model training
  • Persistence: Effects persist after training
  • Stealth: Often undetectable during training
  • Targeted Impact: Can affect specific predictions or general performance

Attack Taxonomy

๐Ÿ“Š By Attack Goal

  • Availability Attacks: Degrade model performance
  • Integrity Attacks: Cause specific misclassifications
  • Backdoor Attacks: Trigger hidden functionality

๐ŸŽฏ By Attack Vector

  • Data Poisoning: Inject malicious training data
  • Model Poisoning: Compromise training process
  • Federated Learning: Target distributed training

๐Ÿ’‰ Data Poisoning Attacks

Label Flipping Attack

Label flipping attacks involve changing the labels of training data to cause the model to learn incorrect associations.

๐Ÿ“ Label Flipping Strategy:

For each selected sample (x_i, y_i):
    if y_i == target_class:
        y_i = wrong_class
    else:
        y_i = target_class
                    

Implementation Example

import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

def label_flipping_attack(X_train, y_train, flip_ratio=0.1, target_class=0):
    """
    Perform label flipping attack on training data
    
    Args:
        X_train: Training features
        y_train: Training labels
        flip_ratio: Percentage of data to poison
        target_class: Class to target for flipping
    
    Returns:
        poisoned_X: Original features (unchanged)
        poisoned_y: Poisoned labels
    """
    poisoned_y = y_train.copy()
    n_samples = len(y_train)
    n_flip = int(n_samples * flip_ratio)
    
    # Get indices of samples to flip
    flip_indices = np.random.choice(n_samples, n_flip, replace=False)
    
    for idx in flip_indices:
        if poisoned_y[idx] == target_class:
            # Flip to a different class
            available_classes = [c for c in np.unique(y_train) if c != target_class]
            poisoned_y[idx] = np.random.choice(available_classes)
        else:
            # Flip to target class
            poisoned_y[idx] = target_class
    
    return X_train, poisoned_y

# Example usage
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(
    iris.data, iris.target, test_size=0.2, random_state=42
)

# Perform label flipping attack
poisoned_X, poisoned_y = label_flipping_attack(
    X_train, y_train, flip_ratio=0.15, target_class=0
)
                

Feature Poisoning Attack

Feature poisoning involves modifying the feature values of training data while keeping labels intact.

def feature_poisoning_attack(X_train, y_train, poison_ratio=0.1, noise_std=0.1):
    """
    Perform feature poisoning attack by adding noise to features
    
    Args:
        X_train: Training features
        y_train: Training labels
        poison_ratio: Percentage of data to poison
        noise_std: Standard deviation of noise to add
    
    Returns:
        poisoned_X: Poisoned features
        poisoned_y: Original labels (unchanged)
    """
    poisoned_X = X_train.copy()
    n_samples = len(X_train)
    n_poison = int(n_samples * poison_ratio)
    
    # Select random samples to poison
    poison_indices = np.random.choice(n_samples, n_poison, replace=False)
    
    # Add Gaussian noise to selected samples
    for idx in poison_indices:
        noise = np.random.normal(0, noise_std, X_train.shape[1])
        poisoned_X[idx] += noise
    
    return poisoned_X, y_train
                

๐Ÿšช Backdoor Attacks

Understanding Backdoor Attacks

Backdoor attacks embed hidden functionality in models that can be triggered by specific input patterns (triggers) during inference, causing the model to produce attacker-desired outputs.

๐Ÿ”‘ Backdoor Components:

  • Trigger Pattern: Specific input modification
  • Target Label: Desired output when trigger is present
  • Poisoned Data: Training samples with triggers
  • Clean Performance: Model works normally without trigger

Implementation Example

import torch
import torch.nn as nn
import torch.optim as optim
from torchvision import datasets, transforms

class BackdoorAttack:
    def __init__(self, trigger_pattern, target_label, poison_ratio=0.1):
        """
        Initialize backdoor attack
        
        Args:
            trigger_pattern: Pattern to add to images (e.g., small patch)
            target_label: Label to assign when trigger is present
            poison_ratio: Ratio of training data to poison
        """
        self.trigger_pattern = trigger_pattern
        self.target_label = target_label
        self.poison_ratio = poison_ratio
    
    def create_poisoned_data(self, dataset):
        """
        Create poisoned training data with backdoor triggers
        
        Args:
            dataset: Original training dataset
        
        Returns:
            poisoned_dataset: Dataset with backdoor samples
        """
        poisoned_samples = []
        n_samples = len(dataset)
        n_poison = int(n_samples * self.poison_ratio)
        
        # Select random samples to poison
        poison_indices = np.random.choice(n_samples, n_poison, replace=False)
        
        for i, (image, label) in enumerate(dataset):
            if i in poison_indices:
                # Add trigger and change label
                poisoned_image = self.add_trigger(image)
                poisoned_samples.append((poisoned_image, self.target_label))
            else:
                # Keep original sample
                poisoned_samples.append((image, label))
        
        return poisoned_samples
    
    def add_trigger(self, image):
        """
        Add trigger pattern to image
        
        Args:
            image: Original image tensor
        
        Returns:
            triggered_image: Image with trigger pattern
        """
        triggered_image = image.clone()
        
        # Add trigger pattern (example: small white patch)
        if len(image.shape) == 3:  # RGB image
            triggered_image[:, -10:, -10:] = self.trigger_pattern
        else:  # Grayscale image
            triggered_image[-10:, -10:] = self.trigger_pattern
        
        return triggered_image
    
    def test_backdoor(self, model, test_dataset):
        """
        Test backdoor effectiveness on clean and triggered data
        
        Args:
            model: Trained model
            test_dataset: Test dataset
        
        Returns:
            clean_accuracy: Accuracy on clean data
            backdoor_success: Success rate on triggered data
        """
        model.eval()
        clean_correct = 0
        backdoor_correct = 0
        total = len(test_dataset)
        
        with torch.no_grad():
            for image, label in test_dataset:
                # Test on clean data
                clean_pred = torch.argmax(model(image.unsqueeze(0)), dim=1)
                if clean_pred.item() == label:
                    clean_correct += 1
                
                # Test on triggered data
                triggered_image = self.add_trigger(image)
                backdoor_pred = torch.argmax(model(triggered_image.unsqueeze(0)), dim=1)
                if backdoor_pred.item() == self.target_label:
                    backdoor_correct += 1
        
        clean_accuracy = clean_correct / total
        backdoor_success = backdoor_correct / total
        
        return clean_accuracy, backdoor_success

# Example usage
trigger_pattern = torch.ones(3, 10, 10)  # White 10x10 patch
backdoor_attack = BackdoorAttack(
    trigger_pattern=trigger_pattern,
    target_label=0,  # Always predict class 0
    poison_ratio=0.1
)
                

๐Ÿ”„ Model Poisoning in Federated Learning

Federated Learning Vulnerabilities

In federated learning, malicious participants can poison the global model by submitting malicious model updates during the aggregation process.

๐Ÿ“ Model Poisoning Strategy:

# Malicious client update
w_malicious = w_global + ฮฒ * (w_target - w_global)

Where:
- w_global: Current global model weights
- w_target: Attacker's desired model weights
- ฮฒ: Poisoning strength parameter
                    

Implementation Example

class FederatedPoisoning:
    def __init__(self, target_model_weights, poisoning_strength=1.0):
        """
        Initialize federated learning poisoning attack
        
        Args:
            target_model_weights: Desired model weights
            poisoning_strength: Strength of poisoning (ฮฒ parameter)
        """
        self.target_weights = target_model_weights
        self.poisoning_strength = poisoning_strength
    
    def poison_model_update(self, global_weights):
        """
        Create poisoned model update for federated learning
        
        Args:
            global_weights: Current global model weights
        
        Returns:
            poisoned_weights: Malicious model weights
        """
        poisoned_weights = {}
        
        for layer_name in global_weights:
            # Apply poisoning formula
            global_w = global_weights[layer_name]
            target_w = self.target_weights[layer_name]
            
            # w_malicious = w_global + ฮฒ * (w_target - w_global)
            poisoned_weights[layer_name] = (
                global_w + self.poisoning_strength * (target_w - global_w)
            )
        
        return poisoned_weights
    
    def adaptive_poisoning(self, global_weights, round_number, max_rounds):
        """
        Adaptive poisoning that changes strength over time
        
        Args:
            global_weights: Current global model weights
            round_number: Current training round
            max_rounds: Total number of rounds
        
        Returns:
            poisoned_weights: Malicious model weights
        """
        # Gradually increase poisoning strength
        adaptive_strength = self.poisoning_strength * (round_number / max_rounds)
        
        poisoned_weights = {}
        for layer_name in global_weights:
            global_w = global_weights[layer_name]
            target_w = self.target_weights[layer_name]
            
            poisoned_weights[layer_name] = (
                global_w + adaptive_strength * (target_w - global_w)
            )
        
        return poisoned_weights

# Example usage for federated learning poisoning
def simulate_federated_attack():
    """
    Simulate federated learning with malicious client
    """
    # Initialize global model
    global_model = create_model()
    global_weights = global_model.state_dict()
    
    # Attacker's target model
    target_model = create_model()
    # Modify target model to have backdoor behavior
    target_weights = target_model.state_dict()
    
    # Initialize poisoning attack
    poison_attack = FederatedPoisoning(target_weights, poisoning_strength=0.5)
    
    # Simulate multiple rounds
    for round_num in range(10):
        # Normal clients send legitimate updates
        # ... (simulate normal updates)
        
        # Malicious client sends poisoned update
        if round_num % 3 == 0:  # Attack every 3rd round
            poisoned_weights = poison_attack.poison_model_update(global_weights)
            # In real federated learning, this would be sent to server
            print(f"Round {round_num}: Malicious update sent")
        
        # Server aggregates updates (simplified)
        # global_weights = aggregate_updates(all_client_updates)
                

๐Ÿงช Hands-On Exercise

Exercise: Implement and Evaluate Poisoning Attacks

Objective: Implement data poisoning, backdoor attacks, and evaluate their effectiveness on different models.

๐Ÿ“‹ Steps:

  1. Setup Environment

    Prepare datasets and models for poisoning experiments:

    import torch
    import torch.nn as nn
    import torch.optim as optim
    from torchvision import datasets, transforms
    from sklearn.datasets import load_breast_cancer
    from sklearn.ensemble import RandomForestClassifier
    from sklearn.svm import SVC
    from sklearn.metrics import accuracy_score, classification_report
    
    # Load datasets
    transform = transforms.Compose([
        transforms.ToTensor(),
        transforms.Normalize((0.1307,), (0.3081,))
    ])
    
    mnist_train = datasets.MNIST('./data', train=True, download=True, transform=transform)
    mnist_test = datasets.MNIST('./data', train=False, transform=transform)
                                
  2. Implement Data Poisoning

    Create poisoned datasets with different attack strategies:

    def evaluate_poisoning_attack(model, clean_data, poisoned_data, test_data):
        """
        Evaluate the effectiveness of a poisoning attack
        
        Args:
            model: Model to test
            clean_data: Clean training data
            poisoned_data: Poisoned training data
            test_data: Test data for evaluation
        
        Returns:
            results: Dictionary with evaluation metrics
        """
        # Train model on clean data
        clean_model = train_model(model, clean_data)
        clean_accuracy = evaluate_model(clean_model, test_data)
        
        # Train model on poisoned data
        poisoned_model = train_model(model, poisoned_data)
        poisoned_accuracy = evaluate_model(poisoned_model, test_data)
        
        # Calculate attack effectiveness
        performance_drop = clean_accuracy - poisoned_accuracy
        
        return {
            'clean_accuracy': clean_accuracy,
            'poisoned_accuracy': poisoned_accuracy,
            'performance_drop': performance_drop,
            'attack_success': performance_drop > 0.05  # 5% threshold
        }
    
    # Test different poisoning ratios
    poisoning_ratios = [0.05, 0.1, 0.2, 0.3]
    results = []
    
    for ratio in poisoning_ratios:
        # Create poisoned data
        poisoned_X, poisoned_y = label_flipping_attack(
            X_train, y_train, flip_ratio=ratio, target_class=0
        )
        
        # Evaluate attack
        result = evaluate_poisoning_attack(
            RandomForestClassifier(), 
            (X_train, y_train),
            (poisoned_X, poisoned_y),
            (X_test, y_test)
        )
        result['poison_ratio'] = ratio
        results.append(result)
                                
  3. Implement Backdoor Attack

    Create and test backdoor attacks on neural networks:

    class SimpleCNN(nn.Module):
        def __init__(self):
            super(SimpleCNN, self).__init__()
            self.conv1 = nn.Conv2d(1, 32, 3, 1)
            self.conv2 = nn.Conv2d(32, 64, 3, 1)
            self.dropout1 = nn.Dropout2d(0.25)
            self.dropout2 = nn.Dropout2d(0.5)
            self.fc1 = nn.Linear(9216, 128)
            self.fc2 = nn.Linear(128, 10)
        
        def forward(self, x):
            x = self.conv1(x)
            x = F.relu(x)
            x = self.conv2(x)
            x = F.relu(x)
            x = F.max_pool2d(x, 2)
            x = self.dropout1(x)
            x = torch.flatten(x, 1)
            x = self.fc1(x)
            x = F.relu(x)
            x = self.dropout2(x)
            x = self.fc2(x)
            return F.log_softmax(x, dim=1)
    
    def train_backdoor_model(model, clean_data, backdoor_attack, epochs=5):
        """
        Train model with backdoor attack
        
        Args:
            model: Model to train
            clean_data: Clean training data
            backdoor_attack: Backdoor attack object
            epochs: Number of training epochs
        
        Returns:
            trained_model: Model trained with backdoor
        """
        model.train()
        optimizer = optim.Adam(model.parameters())
        
        # Create poisoned dataset
        poisoned_data = backdoor_attack.create_poisoned_data(clean_data)
        
        for epoch in range(epochs):
            for batch_idx, (data, target) in enumerate(poisoned_data):
                optimizer.zero_grad()
                output = model(data.unsqueeze(0))
                loss = F.nll_loss(output, torch.tensor([target]))
                loss.backward()
                optimizer.step()
        
        return model
                                
  4. Evaluate Attack Effectiveness

    Compare different attack methods and measure their impact:

    def comprehensive_evaluation():
        """
        Comprehensive evaluation of all poisoning attacks
        """
        results = {}
        
        # 1. Data Poisoning Evaluation
        print("Evaluating Data Poisoning Attacks...")
        poisoning_results = []
        for ratio in [0.05, 0.1, 0.2]:
            poisoned_X, poisoned_y = label_flipping_attack(
                X_train, y_train, flip_ratio=ratio
            )
            result = evaluate_poisoning_attack(
                RandomForestClassifier(),
                (X_train, y_train),
                (poisoned_X, poisoned_y),
                (X_test, y_test)
            )
            result['attack_type'] = 'label_flipping'
            result['poison_ratio'] = ratio
            poisoning_results.append(result)
        
        results['data_poisoning'] = poisoning_results
        
        # 2. Backdoor Attack Evaluation
        print("Evaluating Backdoor Attacks...")
        model = SimpleCNN()
        trigger = torch.ones(1, 4, 4)  # 4x4 white patch
        backdoor = BackdoorAttack(trigger, target_label=0, poison_ratio=0.1)
        
        # Train with backdoor
        backdoor_model = train_backdoor_model(model, mnist_train, backdoor)
        
        # Test backdoor effectiveness
        clean_acc, backdoor_success = backdoor.test_backdoor(backdoor_model, mnist_test)
        
        results['backdoor'] = {
            'clean_accuracy': clean_acc,
            'backdoor_success_rate': backdoor_success,
            'stealth_score': 1 - abs(clean_acc - 0.9)  # How close to normal performance
        }
        
        return results
    
    # Run comprehensive evaluation
    evaluation_results = comprehensive_evaluation()
                                

๐Ÿ“„ Deliverables:

  • Working implementations of all poisoning attack types
  • Comparison of attack effectiveness across different models
  • Analysis of stealth vs. effectiveness trade-offs
  • Visualization of poisoned vs. clean data
  • Defense mechanism recommendations

๐Ÿ“Š Knowledge Check

Question 1: What is the main difference between data poisoning and backdoor attacks?

Question 2: In federated learning poisoning, what does the ฮฒ parameter control?

Question 3: Which type of poisoning attack is most suitable for maintaining normal model performance while enabling hidden functionality?