What is Tokens (LLM)? | Oximy Glossary

What are Tokens in LLMs?

Tokens are the fundamental units of text that large language models process. They can represent words, subwords, characters, or punctuation. Tokenization converts text into these units before processing and affects context limits and pricing.

Token Characteristics

Not Always Words

"unhappy" → ["un", "happy"]
"don't" → ["don", "'t"]
Common words often = 1 token

Language Dependent

English: ~4 characters/token
Other languages: May use more tokens

Special Tokens

[BOS] - Beginning of sequence
[EOS] - End of sequence
[PAD] - Padding

Tokenization Methods

BPE (Byte-Pair Encoding)

GPT models
Learns common pairs
Efficient vocabulary

WordPiece

BERT models
Similar to BPE
prefix for subwords

SentencePiece

Language-agnostic
Unigram model
Works on raw text

Token Estimation

English Approximation

~750 words ≈ 1,000 tokens
~4 characters ≈ 1 token

Counting Tools

OpenAI tokenizer
tiktoken library
HuggingFace tokenizers

Why Tokens Matter

Context Limits Total tokens = input + output.

Pricing APIs charge per token.

Performance More tokens = more processing.

Best Practices

Count tokens before API calls
Optimize prompts for efficiency
Consider token limits in design
Monitor token usage
Use efficient formatting

Tokens (LLM)