Adversarial Attacks

Techniques that manipulate inputs to machine learning models to cause misclassification or unexpected behavior, often through imperceptible perturbations.

Also known as:Adversarial ExamplesAI Attacks

What are Adversarial Attacks?

Adversarial attacks are techniques designed to fool machine learning models by introducing carefully crafted perturbations to input data. These attacks exploit vulnerabilities in how models process information, causing them to make incorrect predictions or classifications.

Types of Adversarial Attacks

Evasion Attacks

  • Modify inputs at inference time
  • Most common type
  • Examples: adversarial images, audio

Poisoning Attacks

  • Corrupt training data
  • Degrade model performance
  • Insert backdoors

Model Extraction

  • Steal model functionality
  • Query-based attacks
  • Reverse engineering

Model Inversion

  • Extract training data
  • Privacy violations
  • Membership inference

Attack Methods

White-Box Attacks Full model access:

  • FGSM (Fast Gradient Sign Method)
  • PGD (Projected Gradient Descent)
  • C&W Attack

Black-Box Attacks No model access:

  • Transfer attacks
  • Query-based attacks
  • Boundary attacks

Defense Strategies

  • Adversarial training
  • Input preprocessing
  • Defensive distillation
  • Certified defenses
  • Ensemble methods
  • Detection mechanisms