What is Inference? | Oximy Glossary

What is Inference in Machine Learning?

Inference is the process of running data through a trained machine learning model to generate predictions, classifications, or outputs. It's the "production" phase of ML where models are used to make decisions on new data.

Training vs. Inference

Training	Inference
Learns from data	Applies learning
Updates weights	Fixed weights
High compute	Lower compute
Batch processing	Often real-time
Development phase	Production phase

Inference Types

Batch Inference

Process large datasets
Scheduled jobs
Offline processing

Real-Time Inference

Low-latency responses
API endpoints
User-facing applications

Streaming Inference

Continuous data flow
Event-driven
Near real-time

Performance Considerations

Latency

Time to first token
Total response time
Percentile metrics

Throughput

Requests per second
Tokens per second
Concurrent users

Cost

Compute resources
API pricing
Infrastructure

Optimization Techniques

Model Optimization

Quantization
Pruning
Knowledge distillation

Infrastructure

GPU acceleration
Batching
Caching
Model serving frameworks

Deployment Options

Cloud APIs (OpenAI, Anthropic)
Self-hosted (vLLM, TGI)
Edge deployment
Hybrid approaches