๐Ÿ“š Learning Objectives

By the end of this lesson, you will be able to:

๐ŸŽฏ AI Threat Taxonomy

Primary Threat Categories

๐Ÿ”ด Evasion Attacks

Attacks during inference to cause misclassification

  • Adversarial examples
  • FGSM, PGD, C&W attacks
  • Universal adversarial perturbations

๐ŸŸก Poisoning Attacks

Attacks during training to compromise model behavior

  • Data poisoning
  • Label flipping
  • Backdoor insertion

๐ŸŸ  Extraction Attacks

Attacks to steal or reconstruct models

  • Model extraction
  • Model inversion
  • Membership inference

๐Ÿ”ต Privacy Attacks

Attacks to extract sensitive information

  • Training data reconstruction
  • Attribute inference
  • Property inference

โš”๏ธ Evasion Attacks (Adversarial Examples)

What are Adversarial Examples?

Adversarial examples are inputs that have been subtly modified to cause machine learning models to make incorrect predictions, while remaining virtually indistinguishable from the original inputs to human observers.

Fast Gradient Sign Method (FGSM)

Principle: Uses the gradient of the loss function to create adversarial examples

x' = x + ฮต ร— sign(โˆ‡โ‚“J(ฮธ, x, y))
  • Advantages: Fast and simple
  • Disadvantages: Single-step attack, easily defended
  • Use Case: Quick adversarial example generation

Projected Gradient Descent (PGD)

Principle: Iterative version of FGSM with projection

xโ‚€ = x + random_noise
xโ‚™โ‚Šโ‚ = ฮ (xโ‚™ + ฮฑ ร— sign(โˆ‡โ‚“J(ฮธ, xโ‚™, y)))
  • Advantages: Strong attack, hard to defend
  • Disadvantages: Computationally expensive
  • Use Case: Robustness testing

Carlini & Wagner (C&W)

Principle: Optimizes for minimal perturbation

minimize ||ฮด||โ‚‚ + c ร— f(x + ฮด)
  • Advantages: Very effective, minimal perturbation
  • Disadvantages: Slow, requires optimization
  • Use Case: Breaking defensive measures

โ˜ ๏ธ Data Poisoning Attacks

Types of Data Poisoning

1. Label Flipping

Changing the labels of training examples to mislead the model.

Example:

In a spam detection system, changing legitimate emails to be labeled as spam, causing the model to classify legitimate emails as spam.

2. Data Injection

Adding malicious samples to the training dataset.

Example:

Injecting images with specific triggers into a facial recognition training set to create backdoors.

3. Backdoor Attacks

Inserting hidden functionality into the model through poisoned data.

Example:

Training a model to recognize a specific pattern (backdoor trigger) that causes it to misclassify inputs when the trigger is present.

Impact of Poisoning Attacks

  • Performance Degradation: Reduced accuracy on legitimate data
  • Security Bypass: Model fails on specific inputs
  • Privacy Violation: Unauthorized access through backdoors
  • Reputation Damage: Loss of trust in AI systems

๐Ÿ•ต๏ธ Model Extraction Attacks

What is Model Extraction?

Model extraction attacks aim to steal or reconstruct machine learning models by querying them and analyzing the responses.

1. Black-box Extraction

Extracting model functionality without access to internal parameters.

Process:
  1. Query the target model with various inputs
  2. Collect input-output pairs
  3. Train a surrogate model on the collected data
  4. Use the surrogate model for inference

2. Model Inversion

Reconstructing training data from model outputs.

Process:
  1. Query model with specific target outputs
  2. Use optimization to find inputs that produce target outputs
  3. Reconstruct original training data

3. Membership Inference

Determining if specific data points were in the training set.

Process:
  1. Train shadow models on similar data
  2. Analyze confidence scores and outputs
  3. Build inference models to detect membership

Consequences of Model Extraction

  • Intellectual Property Theft: Stealing proprietary models
  • Privacy Violations: Accessing sensitive training data
  • Competitive Advantage: Using stolen models for profit
  • Security Bypass: Understanding model behavior for attacks

๐Ÿ”’ Privacy Attacks on ML Models

Types of Privacy Attacks

1. Training Data Reconstruction

Reconstructing individual training examples from model parameters or outputs.

Methods:
  • Model Inversion: Using model outputs to reconstruct inputs
  • Gradient-based Attacks: Using gradients to reconstruct data
  • GAN-based Reconstruction: Using generative models

2. Attribute Inference

Inferring sensitive attributes about individuals from model outputs.

Example:

From a recommendation system's output, inferring a user's age, gender, or political preferences.

3. Property Inference

Inferring properties of the training dataset.

Example:

Determining the distribution of sensitive attributes in the training data.

Privacy Protection Techniques

  • Differential Privacy: Adding noise to protect individual privacy
  • Federated Learning: Training without centralizing data
  • Homomorphic Encryption: Computing on encrypted data
  • Secure Multi-party Computation: Computing without revealing inputs

๐Ÿงช Hands-On Exercise

Exercise: Threat Analysis Workshop

Objective: Analyze different AI systems and identify applicable threats.

๐Ÿ“‹ Scenarios:

Scenario 1: Autonomous Vehicle AI

An AI system that processes camera feeds to identify road signs and obstacles.

  • What types of attacks are possible?
  • Which threats pose the highest risk?
  • What would be the impact of successful attacks?
Scenario 2: Medical Diagnosis AI

An AI system that analyzes medical images to detect diseases.

  • What privacy concerns exist?
  • How could the system be compromised?
  • What are the ethical implications?
Scenario 3: Financial Fraud Detection

An AI system that monitors transactions to detect fraudulent activity.

  • What evasion techniques could be used?
  • How could attackers bypass the system?
  • What data could be extracted?

๐Ÿ“„ Deliverables:

  • Threat assessment for each scenario
  • Risk matrix with likelihood and impact
  • Recommended mitigation strategies
  • Incident response procedures

๐Ÿ“Š Knowledge Check

Question 1: What is the main goal of evasion attacks?

Question 2: Which attack involves modifying training data?

Question 3: What is the purpose of model extraction attacks?