Definition

Token

A token is the atomic unit an LLM processes — typically a short piece of text (a word, part of a word, or symbol) produced by a tokenizer.

A token is the atomic unit an LLM processes. It's a short piece of text — sometimes a whole word, often a subword fragment, sometimes a single character or symbol — produced by a tokenizer that splits input according to a learned vocabulary. The model sees tokens, not characters, and pricing, context limits, and rate limits are all measured in them.

Why it matters

When a provider says "200k context" they mean tokens, not characters. When they charge "$3 per million input tokens," that's the unit. A rough rule for English: 1 token = 4 characters = 0.75 words. Code can be denser (operators and whitespace) or sparser (long identifiers). Non-English languages often tokenize worse — a paragraph of Chinese or Arabic may take 2-3× more tokens than the equivalent English.

For agentic coding, tokens are the currency you spend. Every file read, every tool output, every reasoning trace uses tokens from the context window and bills against the API. Efficient prompting and tight tool outputs save real money.

How it works

Most modern LLMs use byte-pair encoding (BPE) or a variant like SentencePiece. The tokenizer is trained on a large corpus: frequent sequences ("ing", "the", "print") become single tokens; rare ones get split into more pieces. Every model has its own vocabulary — Claude's tokenizer differs from GPT's, which differs from Qwen's — so token counts aren't directly comparable across providers.

Example (GPT-style tokenizer):

  • "hello" — 1 token
  • "hello world" — 2 tokens
  • "antidisestablishmentarianism" — 5-6 tokens
  • "SpaceSpider" — 3 tokens (Space, Sp, ider)

Providers usually ship a count_tokens endpoint or SDK helper so you can estimate cost before sending.

How it's used

Practical token awareness:

  • Reading 10k lines of verbose log probably costs more than reading the 500 lines that matter
  • Minified code uses more tokens than formatted code (weirdly) because the tokenizer is trained on normal whitespace
  • Emoji and exotic Unicode can explode token counts — avoid in prompts unless necessary
  • Caching long system prompts saves input tokens on every subsequent call
  • LLM — the consumer of tokens
  • Context window — measured in tokens
  • Embedding — a vector form, not tokens per se
  • Hallucination — unrelated but another LLM concept
  • RAG — reduces token cost by retrieving only what's needed

FAQ

How many tokens is my repo?

Roughly (total source chars) / 4. A 200k-line Python project might be 1-3M tokens, far above most windows — which is why embeddings and RAG exist.

Does the model "understand" tokens or characters?

Tokens. The model has never seen raw characters during training; every input it processes has been tokenized. This is why models sometimes struggle with character-level tasks like counting letters in a word.

Related terms