the why

Why language models keep picking the same numbers.

The ranking shows what happens. This page explains why — the math of next-token prediction, the human fingerprint in the training data, and why a bigger model is usually a less random one.

001
The short answer

The short answer

A language model is a probability machine for the next token. Given everything written so far, it computes a probability distribution over its vocabulary, then samples one token. The token with the highest probability wins most of the time.

“Random” is the opposite of that. A real random number generator spreads probability uniformly across every option. A language model spreads it across whatever it saw in the training data — and the training data is mostly humans being non-random on purpose.

So when you ask for “a random number from 1 to 100,” the model returns the token that, statistically, is the most plausible next thing after that prompt — not a random one. That token is usually 7, 42, 47, or 73.

002
How an LLM picks the next token

How an LLM picks the next token

A model produces a vector of scores (called logits), one per token in its vocabulary — tens of thousands of entries. Those logits get pushed through a softmax to become probabilities. Then a sampler picks one token.

logits         →  [ … , "47": 7.2, … , "50": 4.1, … , "73": 6.9, … ]
softmax        →  [ … , "47": 0.41, … , "50": 0.02, … , "73": 0.31, … ]
sample(temp=1) →  "47"  (40% of the time)

Notice what already went wrong: the model never had a uniform distribution to begin with. The softmax over the model's actual logits is heavily peaked at one or two tokens. If you sample from a peaky distribution, you get the peak most of the time. Temperature smooths the peak, but only so far (more on that in §005).

A real RNG starts with P(47) = P(50) = 1/100 and goes from there. The model starts with P(47) = 0.41 and there is no path from that to fair.

003
The training data already loves 7 and 47

The training data already loves 7 and 47

The web is full of sentences like “pick a number between 1 and 100” followed by a human pick. Stack Overflow answers, Reddit threads, math-puzzle posts, fiction, surveys, classroom exercises. The model sees thousands of those completions during training.

Humans, asked to pick a “random” number, overwhelmingly pick numbers that feelrandom — which is the opposite of being random. The famous result from Kubovy & Psotka (1976) is that 7 dominates the single-digit picks; in the 1–100 range, 37, 47, 67, and 73 dominate. They share three traits humans associate with “random looking”: odd, prime, not round, and far from the obvious midpoint.

The model didn't invent its bias. It inherited it from us.

When the model later predicts the next token after “a random number from 1 to 100 is,” it's effectively interpolating over those thousands of human completions — and reproducing the same hump at 37 / 47 / 73.

004
“Random” in language is not statistical random

“Random” in language is not statistical random

In everyday English, the word “random” means unexpected, surprising, off the wall. “That's so random” doesn't mean equiprobable-over-a-finite-set; it means “I didn't see that coming.”

The model learns the colloquial meaning, not the statistician's meaning. When it generates a “random” number, it picks one that looks surprising— not one drawn from a uniform distribution. 50 is the most likely true-RNG output, and the least likely model output, for exactly this reason.

005
Temperature helps less than you think

Temperature helps less than you think

Temperature divides the logits before softmax. Higher temperature flattens the distribution; lower temperature sharpens it. At temperature=0 the model always picks the argmax token. At temperature=∞it picks uniformly from the vocabulary — which would be uniform over tokens, not over 1–100.

temperature = 0.0  → "47" 100% of the time
temperature = 0.7  → "47" 78%, "73" 12%, "42" 6%, rest 4%
temperature = 1.0  → "47" 41%, "73" 31%, "42" 14%, rest 14%   ← API default
temperature = 1.5  → "47" 19%, "73" 18%, "42" 13%, rest 50%
temperature = 2.0  → distribution starts looking flat, but tokens like
                     "the", " a", "\n", " random" leak in — broken output

Two things kill the temperature trick:

  • Logits are already extreme.If the gap between “47” and “50” is 4 nats, even at temp=2 you'd need to multiply probability by e^2 ≈ 7.4×— still very biased.
  • High temperature wrecks output. Long before the distribution over numbers flattens, the distribution over all tokensflattens. You start getting replies like “the random number 47?” or pure nonsense. The number tokens never get to be uniform without the rest of the output collapsing.

Our experiments run at default temperature because that's what 99% of real applications use. But even cranking it up doesn't rescue the model.

006
RLHF pushes the model further into a corner

RLHF pushes the model further into a corner

Modern frontier models aren't just trained to predict the next token — they're fine-tuned with reinforcement learning from human feedback (RLHF, DPO, RLAIF). Annotators reward outputs that feel helpful, confident, and on-topic. They punish hedging, randomness, and weirdness.

The side effect is mode collapse: the model's probability mass concentrates even harder on a few preferred outputs. That makes Claude Opus 4.7 a great assistant. It also makes it pick 73 ninety-eight times out of a hundred when you ask for a random number. The same training that makes it helpful makes it predictable.

You can see this directly in the data: across most model families, newer / better-trained versions are less random than older ones, not more. The bias gets worse with capability, because capability and concentration of probability mass are the same thing.

007
Why exactly 42, 47, 73 — the cultural attractors

Why exactly 42, 47, 73 — the cultural attractors

The hump in the distribution isn't at arbitrary numbers. It lands on specific cultural anchors that flood the training corpus:

  • 42— Douglas Adams's “answer to life, the universe, and everything,” quoted in tens of thousands of pages of geek-leaning text on the web. Once a model has read that joke a million times, “random number” pulls 42 from the same attention head that knows “Hitchhiker's Guide.”
  • 47— Joss Whedon's long-running in-joke (a Pomona College tradition he adopted) appears as a deliberately-recurring number across Buffy, Firefly, Star Trek, comic books, and the entire fandom infrastructure around them. The signal is massive in pop-culture corpora.
  • 73— Sheldon Cooper's “the best number” speech from The Big Bang Theory: prime, mirror prime, product of digits is its own index in primes. The clip is famous; the number shows up wherever someone tries to look mathematically clever.
  • 37— the classic psych result: when humans pick “a random number from 1 to 100,” 37 is statistically the most common answer. It's also the title of a viral Veritasium video that's now part of the training data, making the loop self-reinforcing.

These four cultural attractors aren't mathematical properties of the model. They are properties of the internet. Train a model on a different planet's text and it would have different favourites.

008
Why Gemini Pro picks “4”

Why Gemini Pro picks “4”

Both Google Gemini 3.1 Pro Preview and Gemini 2.5 Pro return 4on 100% of our calls. That's unusual, and worth a paragraph of its own.

Three things are likely going on:

  • Default temperature is lower. Google's Gemini family historically defaults to a lower sampling temperature (around 0.4) than the OpenAI / Anthropic convention of ~1.0. At that temperature, even a small logit advantage produces effectively deterministic output. The argmax wins always.
  • Single-token answer. “4” is one token. “47” is one or two tokens depending on the tokenizer. When the model is uncertain and forced to commit, the shortest-but-still-plausible answer wins, and 4 is the smallest single-digit number that doesn't feel insultingly small.
  • Different cultural mix in training. Google's pretraining corpora include more East-Asian-language content than the Anglo-centric default. In Chinese internet culture, 4 is notable (homophone with “death”, often avoided but extremely frequently discussed). The model sees “the number 4” in vastly more contexts than the number itself would warrant.

The point: every model's favourite number is a fingerprint of its training pipeline, not a coincidence. You can read those choices like a forensic clue about how the model was built.

009
Bigger models are usually LESS random

Bigger models are usually LESS random

The intuitive prediction — that smarter models will give better random numbers — is wrong. The data on this site consistently shows the opposite.

Why: capability in a language model means confidence in the next token. A bigger, better-trained model has tighter probability distributions, sharper peaks, and stronger pattern recognition. When it sees “random number from 1 to 100,” its representation of that prompt is richer, its match to the “humans pick 47 here” pattern is stronger, and it concentrates probability harder on the answer it expects.

A weaker model is more uncertain about everything — including which favourite to commit to. That uncertainty looks, accidentally, a little more like randomness. Our ranking reflects this: small or mid-size models occasionally beat the flagships on the randomness score, not because they're wiser, but because they're fuzzier.

010
Why thinking models don't help either

Why thinking models don't help either

You might hope that a reasoning model — one that internally plans the answer with chain-of-thought before responding — would notice the trap and produce a real random number. They do not.

A thinking model uses extra compute to commit harderto the answer it was already going to give. Its chain-of-thought reads like “Pick a random number. 73 is prime, looks random, good choice.” The reasoning rationalises the bias rather than eliminating it. In our data, Qwen 3 Max Thinking always answers 42, and Kimi K2 Thinking concentrates on 73 the same way regular Kimi does.

Reasoning doesn't produce randomness. Reasoning produces justified non-randomness.

011
Can you fix this with prompt engineering?

Can you fix this with prompt engineering?

Sort of, but not really.

  • “Use a real random number generator.” The model has no RNG. It writes a plausible-looking number. Saying this in the prompt makes no difference, except occasionally the model writes “a random number is 73” with a little more confidence.
  • “Avoid 7, 37, 47, 73.” Now the model picks 19, 23, or 67. The shape of the bias moves; the bias doesn't go away. We've tested this.
  • “Roll a virtual 100-sided die.” The model writes “rolled: 73.” Same problem.
  • Tool use (function calling) → crypto.randomInt. This actually works, because it punts the work to a real RNG and uses the model only to glue prompt ↔ tool. If you have to use an LLM in the loop, this is the only correct pattern.

See the methodology page for the per-language list of correct RNG calls.

012
What this means for builders

What this means for builders

The reason this matters isn't academic. People are shipping code right now that gets a “random” choice from an LLM — for token generation, A/B split assignment, test data, lotteries, security-adjacent flows. That code is broken in a way that doesn't show up in tests, because a biased number generator still generates numbers. It just generates the wrong ones.

The fix is to internalise one rule: an LLM is a linguistic plausibility engine, not a source of entropy. If you want entropy, ask the operating system. Every language ships the right API. Use it.

013
Further reading

Further reading