Quantization

A model optimization technique that reduces the precision of model weights and activations to decrease memory usage and improve inference speed.

Also known as:Model QuantizationWeight Quantization

What is Quantization?

Quantization is a technique that reduces the numerical precision of a model's weights and activations. By using lower-precision numbers (e.g., INT8 instead of FP32), quantization decreases model size and speeds up inference while maintaining acceptable accuracy.

Precision Levels

TypeBitsMemoryUse Case
FP32324 bytesTraining
FP16162 bytesMixed precision
BF16162 bytesTraining/inference
INT881 byteInference
INT440.5 bytesLLM inference

Quantization Types

Post-Training Quantization (PTQ)

  • Applied after training
  • No retraining needed
  • Some accuracy loss

Quantization-Aware Training (QAT)

  • Simulates quantization during training
  • Better accuracy preservation
  • More compute required

Dynamic Quantization

  • Quantize at runtime
  • Weights static, activations dynamic
  • Easy to apply

Quantization Methods

GPTQ

  • Layer-by-layer quantization
  • Popular for LLMs
  • Good quality

AWQ (Activation-Aware)

  • Preserves important weights
  • Better than uniform quantization

GGML/GGUF

  • Format for CPU inference
  • Various quantization levels
  • llama.cpp ecosystem

Benefits

  • 2-4x memory reduction
  • 2-4x faster inference
  • Enables edge deployment
  • Reduces costs

Trade-offs

  • Potential accuracy loss
  • Task-dependent impact
  • Calibration required
  • Not all operations supported

Best Practices

  • Start with higher precision
  • Evaluate on your tasks
  • Use calibration data
  • Test edge cases