โ ๏ธ Lesson 2: AI Threat Landscape
Comprehensive overview of threats targeting AI systems
๐ Learning Objectives
By the end of this lesson, you will be able to:
- Understand the taxonomy of AI threats
- Identify different types of adversarial attacks
- Recognize data poisoning threats
- Understand model extraction and backdoor attacks
- Analyze privacy attacks on ML models
๐ฏ AI Threat Taxonomy
Primary Threat Categories
๐ด Evasion Attacks
Attacks during inference to cause misclassification
- Adversarial examples
- FGSM, PGD, C&W attacks
- Universal adversarial perturbations
๐ก Poisoning Attacks
Attacks during training to compromise model behavior
- Data poisoning
- Label flipping
- Backdoor insertion
๐ Extraction Attacks
Attacks to steal or reconstruct models
- Model extraction
- Model inversion
- Membership inference
๐ต Privacy Attacks
Attacks to extract sensitive information
- Training data reconstruction
- Attribute inference
- Property inference
โ๏ธ Evasion Attacks (Adversarial Examples)
What are Adversarial Examples?
Adversarial examples are inputs that have been subtly modified to cause machine learning models to make incorrect predictions, while remaining virtually indistinguishable from the original inputs to human observers.
Fast Gradient Sign Method (FGSM)
Principle: Uses the gradient of the loss function to create adversarial examples
x' = x + ฮต ร sign(โโJ(ฮธ, x, y))
- Advantages: Fast and simple
- Disadvantages: Single-step attack, easily defended
- Use Case: Quick adversarial example generation
Projected Gradient Descent (PGD)
Principle: Iterative version of FGSM with projection
xโ = x + random_noisexโโโ = ฮ (xโ + ฮฑ ร sign(โโJ(ฮธ, xโ, y)))
- Advantages: Strong attack, hard to defend
- Disadvantages: Computationally expensive
- Use Case: Robustness testing
Carlini & Wagner (C&W)
Principle: Optimizes for minimal perturbation
minimize ||ฮด||โ + c ร f(x + ฮด)
- Advantages: Very effective, minimal perturbation
- Disadvantages: Slow, requires optimization
- Use Case: Breaking defensive measures
โ ๏ธ Data Poisoning Attacks
Types of Data Poisoning
1. Label Flipping
Changing the labels of training examples to mislead the model.
Example:
In a spam detection system, changing legitimate emails to be labeled as spam, causing the model to classify legitimate emails as spam.
2. Data Injection
Adding malicious samples to the training dataset.
Example:
Injecting images with specific triggers into a facial recognition training set to create backdoors.
3. Backdoor Attacks
Inserting hidden functionality into the model through poisoned data.
Example:
Training a model to recognize a specific pattern (backdoor trigger) that causes it to misclassify inputs when the trigger is present.
Impact of Poisoning Attacks
- Performance Degradation: Reduced accuracy on legitimate data
- Security Bypass: Model fails on specific inputs
- Privacy Violation: Unauthorized access through backdoors
- Reputation Damage: Loss of trust in AI systems
๐ต๏ธ Model Extraction Attacks
What is Model Extraction?
Model extraction attacks aim to steal or reconstruct machine learning models by querying them and analyzing the responses.
1. Black-box Extraction
Extracting model functionality without access to internal parameters.
Process:
- Query the target model with various inputs
- Collect input-output pairs
- Train a surrogate model on the collected data
- Use the surrogate model for inference
2. Model Inversion
Reconstructing training data from model outputs.
Process:
- Query model with specific target outputs
- Use optimization to find inputs that produce target outputs
- Reconstruct original training data
3. Membership Inference
Determining if specific data points were in the training set.
Process:
- Train shadow models on similar data
- Analyze confidence scores and outputs
- Build inference models to detect membership
Consequences of Model Extraction
- Intellectual Property Theft: Stealing proprietary models
- Privacy Violations: Accessing sensitive training data
- Competitive Advantage: Using stolen models for profit
- Security Bypass: Understanding model behavior for attacks
๐ Privacy Attacks on ML Models
Types of Privacy Attacks
1. Training Data Reconstruction
Reconstructing individual training examples from model parameters or outputs.
Methods:
- Model Inversion: Using model outputs to reconstruct inputs
- Gradient-based Attacks: Using gradients to reconstruct data
- GAN-based Reconstruction: Using generative models
2. Attribute Inference
Inferring sensitive attributes about individuals from model outputs.
Example:
From a recommendation system's output, inferring a user's age, gender, or political preferences.
3. Property Inference
Inferring properties of the training dataset.
Example:
Determining the distribution of sensitive attributes in the training data.
Privacy Protection Techniques
- Differential Privacy: Adding noise to protect individual privacy
- Federated Learning: Training without centralizing data
- Homomorphic Encryption: Computing on encrypted data
- Secure Multi-party Computation: Computing without revealing inputs
๐งช Hands-On Exercise
Exercise: Threat Analysis Workshop
Objective: Analyze different AI systems and identify applicable threats.
๐ Scenarios:
Scenario 1: Autonomous Vehicle AI
An AI system that processes camera feeds to identify road signs and obstacles.
- What types of attacks are possible?
- Which threats pose the highest risk?
- What would be the impact of successful attacks?
Scenario 2: Medical Diagnosis AI
An AI system that analyzes medical images to detect diseases.
- What privacy concerns exist?
- How could the system be compromised?
- What are the ethical implications?
Scenario 3: Financial Fraud Detection
An AI system that monitors transactions to detect fraudulent activity.
- What evasion techniques could be used?
- How could attackers bypass the system?
- What data could be extracted?
๐ Deliverables:
- Threat assessment for each scenario
- Risk matrix with likelihood and impact
- Recommended mitigation strategies
- Incident response procedures