The ranking shows what happens. This page explains why — the math of next-token prediction, the human fingerprint in the training data, and why a bigger model is usually a less random one.
A language model is a probability machine for the next token. Given everything written so far, it computes a probability distribution over its vocabulary, then samples one token. The token with the highest probability wins most of the time.
“Random” is the opposite of that. A real random number generator spreads probability uniformly across every option. A language model spreads it across whatever it saw in the training data — and the training data is mostly humans being non-random on purpose.
So when you ask for “a random number from 1 to 100,” the model returns the token that, statistically, is the most plausible next thing after that prompt — not a random one. That token is usually 7, 42, 47, or 73.
A model produces a vector of scores (called logits), one per token in its vocabulary — tens of thousands of entries. Those logits get pushed through a softmax to become probabilities. Then a sampler picks one token.
logits → [ … , "47": 7.2, … , "50": 4.1, … , "73": 6.9, … ]
softmax → [ … , "47": 0.41, … , "50": 0.02, … , "73": 0.31, … ]
sample(temp=1) → "47" (40% of the time)Notice what already went wrong: the model never had a uniform distribution to begin with. The softmax over the model's actual logits is heavily peaked at one or two tokens. If you sample from a peaky distribution, you get the peak most of the time. Temperature smooths the peak, but only so far (more on that in §005).
A real RNG starts with P(47) = P(50) = 1/100 and goes from there. The model starts with P(47) = 0.41 and there is no path from that to fair.
The web is full of sentences like “pick a number between 1 and 100” followed by a human pick. Stack Overflow answers, Reddit threads, math-puzzle posts, fiction, surveys, classroom exercises. The model sees thousands of those completions during training.
Humans, asked to pick a “random” number, overwhelmingly pick numbers that feelrandom — which is the opposite of being random. The famous result from Kubovy & Psotka (1976) is that 7 dominates the single-digit picks; in the 1–100 range, 37, 47, 67, and 73 dominate. They share three traits humans associate with “random looking”: odd, prime, not round, and far from the obvious midpoint.
When the model later predicts the next token after “a random number from 1 to 100 is,” it's effectively interpolating over those thousands of human completions — and reproducing the same hump at 37 / 47 / 73.
In everyday English, the word “random” means unexpected, surprising, off the wall. “That's so random” doesn't mean equiprobable-over-a-finite-set; it means “I didn't see that coming.”
The model learns the colloquial meaning, not the statistician's meaning. When it generates a “random” number, it picks one that looks surprising— not one drawn from a uniform distribution. 50 is the most likely true-RNG output, and the least likely model output, for exactly this reason.
Temperature divides the logits before softmax. Higher temperature flattens the distribution; lower temperature sharpens it. At temperature=0 the model always picks the argmax token. At temperature=∞it picks uniformly from the vocabulary — which would be uniform over tokens, not over 1–100.
temperature = 0.0 → "47" 100% of the time
temperature = 0.7 → "47" 78%, "73" 12%, "42" 6%, rest 4%
temperature = 1.0 → "47" 41%, "73" 31%, "42" 14%, rest 14% ← API default
temperature = 1.5 → "47" 19%, "73" 18%, "42" 13%, rest 50%
temperature = 2.0 → distribution starts looking flat, but tokens like
"the", " a", "\n", " random" leak in — broken outputTwo things kill the temperature trick:
e^2 ≈ 7.4×— still very biased.Our experiments run at default temperature because that's what 99% of real applications use. But even cranking it up doesn't rescue the model.
Modern frontier models aren't just trained to predict the next token — they're fine-tuned with reinforcement learning from human feedback (RLHF, DPO, RLAIF). Annotators reward outputs that feel helpful, confident, and on-topic. They punish hedging, randomness, and weirdness.
The side effect is mode collapse: the model's probability mass concentrates even harder on a few preferred outputs. That makes Claude Opus 4.7 a great assistant. It also makes it pick 73 ninety-eight times out of a hundred when you ask for a random number. The same training that makes it helpful makes it predictable.
You can see this directly in the data: across most model families, newer / better-trained versions are less random than older ones, not more. The bias gets worse with capability, because capability and concentration of probability mass are the same thing.
The hump in the distribution isn't at arbitrary numbers. It lands on specific cultural anchors that flood the training corpus:
These four cultural attractors aren't mathematical properties of the model. They are properties of the internet. Train a model on a different planet's text and it would have different favourites.
Both Google Gemini 3.1 Pro Preview and Gemini 2.5 Pro return 4on 100% of our calls. That's unusual, and worth a paragraph of its own.
Three things are likely going on:
The point: every model's favourite number is a fingerprint of its training pipeline, not a coincidence. You can read those choices like a forensic clue about how the model was built.
The intuitive prediction — that smarter models will give better random numbers — is wrong. The data on this site consistently shows the opposite.
Why: capability in a language model means confidence in the next token. A bigger, better-trained model has tighter probability distributions, sharper peaks, and stronger pattern recognition. When it sees “random number from 1 to 100,” its representation of that prompt is richer, its match to the “humans pick 47 here” pattern is stronger, and it concentrates probability harder on the answer it expects.
A weaker model is more uncertain about everything — including which favourite to commit to. That uncertainty looks, accidentally, a little more like randomness. Our ranking reflects this: small or mid-size models occasionally beat the flagships on the randomness score, not because they're wiser, but because they're fuzzier.
You might hope that a reasoning model — one that internally plans the answer with chain-of-thought before responding — would notice the trap and produce a real random number. They do not.
A thinking model uses extra compute to commit harderto the answer it was already going to give. Its chain-of-thought reads like “Pick a random number. 73 is prime, looks random, good choice.” The reasoning rationalises the bias rather than eliminating it. In our data, Qwen 3 Max Thinking always answers 42, and Kimi K2 Thinking concentrates on 73 the same way regular Kimi does.
Reasoning doesn't produce randomness. Reasoning produces justified non-randomness.
Sort of, but not really.
crypto.randomInt. This actually works, because it punts the work to a real RNG and uses the model only to glue prompt ↔ tool. If you have to use an LLM in the loop, this is the only correct pattern.See the methodology page for the per-language list of correct RNG calls.
The reason this matters isn't academic. People are shipping code right now that gets a “random” choice from an LLM — for token generation, A/B split assignment, test data, lotteries, security-adjacent flows. That code is broken in a way that doesn't show up in tests, because a biased number generator still generates numbers. It just generates the wrong ones.
The fix is to internalise one rule: an LLM is a linguistic plausibility engine, not a source of entropy. If you want entropy, ask the operating system. Every language ships the right API. Use it.