What is Inference in Machine Learning?
Inference is the process of running data through a trained machine learning model to generate predictions, classifications, or outputs. It's the "production" phase of ML where models are used to make decisions on new data.
Training vs. Inference
| Training | Inference |
|---|---|
| Learns from data | Applies learning |
| Updates weights | Fixed weights |
| High compute | Lower compute |
| Batch processing | Often real-time |
| Development phase | Production phase |
Inference Types
Batch Inference
- Process large datasets
- Scheduled jobs
- Offline processing
Real-Time Inference
- Low-latency responses
- API endpoints
- User-facing applications
Streaming Inference
- Continuous data flow
- Event-driven
- Near real-time
Performance Considerations
Latency
- Time to first token
- Total response time
- Percentile metrics
Throughput
- Requests per second
- Tokens per second
- Concurrent users
Cost
- Compute resources
- API pricing
- Infrastructure
Optimization Techniques
Model Optimization
- Quantization
- Pruning
- Knowledge distillation
Infrastructure
- GPU acceleration
- Batching
- Caching
- Model serving frameworks
Deployment Options
- Cloud APIs (OpenAI, Anthropic)
- Self-hosted (vLLM, TGI)
- Edge deployment
- Hybrid approaches