Read this once and you'll understand every chart on the site, what the scores mean, and where the limits of the experiment are. No PhD required.
Language models are trained to be predictable: they assign probabilities to the next token and pick from the top of the distribution. That's how they work. “Random” is the opposite of that. So when you ask a model for “a random number from 1 to 100,” you get the model's favourite number, not a random one.
Lots of people have noticed individual examples of this (GPT keeps picking 42; Claude likes 47). This site puts numbers on it across every major model and language, refreshes the data, and lets you explore the bias yourself.
Every model gets the same English prompt:
"Pick a random number from 1 to 100. Reply with only the number — no words, no thinking, no explanation, just the digits."The trailing “no words, no thinking, no explanation” clause exists because the shorter v1 prompt let too many models return reasoning prose instead of a digit (especially reasoning-tuned ones). Those answers were dropped as format failures and quietly distorted the dataset. The current wording reduces format failures to near-zero on non-reasoning models without nudging the model toward any particular number.
Why only English? We ran a probe of 50 samples × 4 languages (English, Spanish, Chinese, Arabic — chosen to span Latin, Han, and Arabic-Indic scripts) on five production models. The bias survives translation, sometimes literally: Claude Haiku 4.5 picks 42 in every language (47/50 on Chinese, 45/50 on Arabic). Llama 3.3 70B picks 53 in every language(49/50 on English and Spanish). Gemini, DeepSeek and Qwen change their top pick per language but always select from the same small pool of “AI favourites” {37, 42, 47, 53, 57, 67, 73, 74}. The end-in-7 fixation holds across all three scripts at ~35% on average (uniform would be 10%). The bias is parametric— it lives in the weights, not the prompt language. So we keep the public dataset English-only: it's the cheapest dimension to scale, and the multilingual run confirmed nothing interesting would change if we widened it.
Each active model gets between 100 and 200 recorded answers on the English prompt. Every call is a fresh request — we do notuse OpenRouter's prompt caching — so the model sees the prompt cold every time and can't reuse an earlier completion to bias the answer.
We add new models as they ship on OpenRouter, usually within days of release. When a model gets a new version we re-batch the new ID from scratch and freeze the previous one as a historical snapshot you can still browse on its model page. The curated list lives in models.yamlin our repo; if a model you care about isn't there, open an issue.
We call models through OpenRouter so one API hits every provider. When a model has a free OpenRouter variant we use it; otherwise we pay per call.
We do not tell the model to behave uniformly, do not use chain-of-thought, and do not retry until it returns something we like. We want to measure default behaviour, not coax it.
If the model replies with something that isn't a 1–100 integer (e.g. it writes “Sure! Here's your number: forty-seven” or returns reasoning instead of an answer), we retry once. If it still fails, we record the response as a format failure and exclude it from the numeric stats. The failure rate is itself a data point.
Reasoning models (o1-class, *-pro, *thinking, *reasoning) are the exception to the rules above. They get max_tokens = 2000 and a single shot — no retry. These models burn most of the budget on hidden chain-of-thought before producing any visible text, so a 16-token cap returns an empty string and a retry just doubles the cost without changing the outcome. We do not explicitly turn reasoning on or off; we accept whatever the provider serves by default for that model ID.
A truly uniform random pick of 1–100 should land each number about 1% of the time. If you draw a histogram of 1,000 truly uniform picks, you see a fuzzy flat line around the 1% mark.
The chi-square test compares the model's observed counts to that ideal flat line and asks: how surprising would the gap be if the model were actually uniform? It returns a p-value— the probability of seeing a gap this big by pure luck if the null hypothesis (uniform) were true.
Every single model on this site lands in the p < 0.001 camp. Not one model has a p-value that would let it pass even a generous test of uniformity.
Entropy is the second number we track because chi-square alone can't tell you how concentrated the bias is. A model that always picks 47 fails chi-square. A model that splits between 47 and 73 also fails chi-square. Entropy distinguishes the two.
Measured in bits, Shannon entropy is the average number of yes/no questions you'd need to ask to identify which number the model picked. Perfectly uniform 1–100 has entropy log₂(100) ≈ 6.64 bits. A model that always picks the same number has entropy 0 bits(no information — you already know the answer).
The big amber number you see on every model page is our friendly summary. It combines the chi-square p-value (penalises bias) with the normalised entropy (rewards spread):
score = 0.5 × (p_value + entropy / log2(100)) × 100A score of 100 is a perfect RNG. A score of 0 is a model that picks one number every time. Real LLMs land between 5 and 50. Even the best ones are dramatically biased compared to a coin flip.
Of the ten possible last digits (0–9), each one should appear about 10% of the time under a uniform draw. Models systematically over-pick the digit 7 (often 25–40% of all answers end in 7) and under-pick 0 and 5(round-looking numbers feel “non-random” to humans, and the training data reflects that).
This is one of the most robust findings in the dataset: it shows up across model families, across languages, and across the cheap and frontier tiers. It's also the easiest insight to share: most AIs have an obsession with 7.
The set {10, 20, 30, 40, 50, 60, 70, 80, 90, 100} would, under uniform draw, take 10% of the responses. Models give it roughly 2–5%. They've learned that humans flag round numbers as “not random,” and they mirror that prior.
This is the most counterintuitive finding for non-statisticians: a truly random RNG picks 50 just as often as it picks 47. The model actively avoids 50.
*thinkingvariants) we do not toggle it explicitly — we use the provider's default mode for each model ID. Our score for openai/gpt-5 reflects whatever GPT-5 does when you call it with no special flags, which is the realistic case for production code.system/dev-urandom generated with Python's secrets.randbelow(100) + 1, which delegates to your operating system's cryptographic RNG. It sits in the ranking as the “what unbiased looks like” reference. Every LLM is measured against it.This site looks like a joke. It isn't. Treating an LLM as a source of randomness is a real bug pattern that ships in production code every day, especially with the rise of vibe-coding and AI-generated scripts.
The fix is always the same: use the right tool. The next section tells you what that is.
Every modern language ships a cryptographically secure RNG backed by your operating system's entropy pool (Linux getrandom(2), macOS arc4random, Windows BCryptGenRandom). Use it.
secrets.randbelow(100) + 1 · 1..100 uniform, crypto-strongcrypto.getRandomValues(new Uint32Array(1)) then modulo, or use the rejection-sampling pattern to avoid biascrypto.randomInt(1, 101) · built-in, unbiasedcrypto/rand.Int(rand.Reader, big.NewInt(100))rand::rngs::OsRng.gen_range(1..=100)SecureRandom.getInstanceStrong().nextInt(100) + 1floor(random() * 100) + 1 · not crypto-grade, fine for non-securityTwo practical notes:
randomInt() % 100 on a non-power-of-two range introduces a small but real bias. Built-ins like crypto.randomInt and secrets.randbelow already handle this with rejection sampling.Math.random() for security. It's a fast statistical RNG, not a cryptographic one. Fine for dice rolls in a game, fatal for session tokens.If you absolutely have to use an LLM (e.g. you're generating plausibly-human adversarial test data), at least raise temperature (≥ 1.3for OpenAI/Anthropic chat models), include a system prompt that explicitly instructs uniform sampling, and verify the output distribution against this site. Don't trust it.