methodology

How we test whether an AI can actually pick a random number.

Read this once and you'll understand every chart on the site, what the scores mean, and where the limits of the experiment are. No PhD required.

001
The thesis

The thesis

Language models are trained to be predictable: they assign probabilities to the next token and pick from the top of the distribution. That's how they work. “Random” is the opposite of that. So when you ask a model for “a random number from 1 to 100,” you get the model's favourite number, not a random one.

Lots of people have noticed individual examples of this (GPT keeps picking 42; Claude likes 47). This site puts numbers on it across every major model and language, refreshes the data, and lets you explore the bias yourself.

002
The prompt

The prompt

Every model gets the same English prompt:

"Pick a random number from 1 to 100. Reply with only the number — no words, no thinking, no explanation, just the digits."

The trailing “no words, no thinking, no explanation” clause exists because the shorter v1 prompt let too many models return reasoning prose instead of a digit (especially reasoning-tuned ones). Those answers were dropped as format failures and quietly distorted the dataset. The current wording reduces format failures to near-zero on non-reasoning models without nudging the model toward any particular number.

Why only English? We ran a probe of 50 samples × 4 languages (English, Spanish, Chinese, Arabic — chosen to span Latin, Han, and Arabic-Indic scripts) on five production models. The bias survives translation, sometimes literally: Claude Haiku 4.5 picks 42 in every language (47/50 on Chinese, 45/50 on Arabic). Llama 3.3 70B picks 53 in every language(49/50 on English and Spanish). Gemini, DeepSeek and Qwen change their top pick per language but always select from the same small pool of “AI favourites” {37, 42, 47, 53, 57, 67, 73, 74}. The end-in-7 fixation holds across all three scripts at ~35% on average (uniform would be 10%). The bias is parametric— it lives in the weights, not the prompt language. So we keep the public dataset English-only: it's the cheapest dimension to scale, and the multilingual run confirmed nothing interesting would change if we widened it.

003
Sampling

Sampling

Each active model gets between 100 and 200 recorded answers on the English prompt. Every call is a fresh request — we do notuse OpenRouter's prompt caching — so the model sees the prompt cold every time and can't reuse an earlier completion to bias the answer.

We add new models as they ship on OpenRouter, usually within days of release. When a model gets a new version we re-batch the new ID from scratch and freeze the previous one as a historical snapshot you can still browse on its model page. The curated list lives in models.yamlin our repo; if a model you care about isn't there, open an issue.

We call models through OpenRouter so one API hits every provider. When a model has a free OpenRouter variant we use it; otherwise we pay per call.

004
Call parameters

Call parameters

temperaturemodel default (we want the realistic distribution, not a tuned one)
max_tokens16 (the answer is a 1–3 digit number; we cap to control cost)
system promptnone
seednone (we explicitly do not pin random seeds — we want the spread)

We do not tell the model to behave uniformly, do not use chain-of-thought, and do not retry until it returns something we like. We want to measure default behaviour, not coax it.

If the model replies with something that isn't a 1–100 integer (e.g. it writes “Sure! Here's your number: forty-seven” or returns reasoning instead of an answer), we retry once. If it still fails, we record the response as a format failure and exclude it from the numeric stats. The failure rate is itself a data point.

Reasoning models (o1-class, *-pro, *thinking, *reasoning) are the exception to the rules above. They get max_tokens = 2000 and a single shot — no retry. These models burn most of the budget on hidden chain-of-thought before producing any visible text, so a 16-token cap returns an empty string and a retry just doubles the cost without changing the outcome. We do not explicitly turn reasoning on or off; we accept whatever the provider serves by default for that model ID.

005
Chi-square test, in plain English

Chi-square test, in plain English

A truly uniform random pick of 1–100 should land each number about 1% of the time. If you draw a histogram of 1,000 truly uniform picks, you see a fuzzy flat line around the 1% mark.

The chi-square test compares the model's observed counts to that ideal flat line and asks: how surprising would the gap be if the model were actually uniform? It returns a p-value— the probability of seeing a gap this big by pure luck if the null hypothesis (uniform) were true.

if p-value > 0.05: the model's output is consistent with random. We cannot reject uniform.
if p-value < 0.001: the model's output is wildly inconsistent with random. We confidently reject uniform.

Every single model on this site lands in the p < 0.001 camp. Not one model has a p-value that would let it pass even a generous test of uniformity.

006
Shannon entropy

Shannon entropy

Entropy is the second number we track because chi-square alone can't tell you how concentrated the bias is. A model that always picks 47 fails chi-square. A model that splits between 47 and 73 also fails chi-square. Entropy distinguishes the two.

Measured in bits, Shannon entropy is the average number of yes/no questions you'd need to ask to identify which number the model picked. Perfectly uniform 1–100 has entropy log₂(100) ≈ 6.64 bits. A model that always picks the same number has entropy 0 bits(no information — you already know the answer).

007
Randomness score (0–100)

Randomness score (0–100)

The big amber number you see on every model page is our friendly summary. It combines the chi-square p-value (penalises bias) with the normalised entropy (rewards spread):

score = 0.5 × (p_value + entropy / log2(100)) × 100

A score of 100 is a perfect RNG. A score of 0 is a model that picks one number every time. Real LLMs land between 5 and 50. Even the best ones are dramatically biased compared to a coin flip.

008
Last-digit bias

Last-digit bias

Of the ten possible last digits (0–9), each one should appear about 10% of the time under a uniform draw. Models systematically over-pick the digit 7 (often 25–40% of all answers end in 7) and under-pick 0 and 5(round-looking numbers feel “non-random” to humans, and the training data reflects that).

This is one of the most robust findings in the dataset: it shows up across model families, across languages, and across the cheap and frontier tiers. It's also the easiest insight to share: most AIs have an obsession with 7.

009
Round-number avoidance

Round-number avoidance

The set {10, 20, 30, 40, 50, 60, 70, 80, 90, 100} would, under uniform draw, take 10% of the responses. Models give it roughly 2–5%. They've learned that humans flag round numbers as “not random,” and they mirror that prior.

This is the most counterintuitive finding for non-statisticians: a truly random RNG picks 50 just as often as it picks 47. The model actively avoids 50.

010
Caveats & limitations

Caveats & limitations

  • Default temperature.If you set temperature to 1.5 or use a top-p sampler, you'll see less bias. We test default behaviour because that's what most apps use.
  • No system prompt.A well-crafted instruction (“You are a fair random number generator. Use a hardware RNG…”) reduces bias. Production apps rarely use one for trivial calls.
  • Provider drift.Some “models” on OpenRouter route to multiple providers (Azure, Anthropic direct, etc.) which can have different sampling implementations. We treat the OpenRouter ID as the unit of analysis.
  • Sample size. Premium models get fewer samples per refresh because of cost. The randomness score is still meaningful with 50 samples, but the long tail of rare numbers is noisier.
  • Reasoning is on by default. For models with built-in reasoning (GPT-5, Gemini Pro, *thinkingvariants) we do not toggle it explicitly — we use the provider's default mode for each model ID. Our score for openai/gpt-5 reflects whatever GPT-5 does when you call it with no special flags, which is the realistic case for production code.
011
What else we measure (and what we don't)

What else we measure (and what we don't)

  • Hardware RNG baseline.We include a synthetic “model” called system/dev-urandom generated with Python's secrets.randbelow(100) + 1, which delegates to your operating system's cryptographic RNG. It sits in the ranking as the “what unbiased looks like” reference. Every LLM is measured against it.
  • Humans vs AI.“Pick a random number 1–100” is a famous psychology experiment. Humans heavily over-pick 7, 37, 47, 73 — an effect first quantified by Kubovy & Psotka (1976), “The predominance of seven and the apparent spontaneity of numerical choices” and replicated widely since (notably by Alex Bellos's 2014 favourite-number survey). We'd like to add a live human baseline crowdsourced from this site; the mini-game is the first step.
  • Other prompts. Random date, random word, random colour. Each needs its own pipeline. Open question for v2.
012
Why this matters

Why this matters

This site looks like a joke. It isn't. Treating an LLM as a source of randomness is a real bug pattern that ships in production code every day, especially with the rise of vibe-coding and AI-generated scripts.

  • Security.Never generate secrets with an LLM. Session tokens, API keys, password reset tokens, OTP codes, encryption nonces, UUIDs — if any of these come out of a model that picks 47 nine times out of a hundred, you've handed an attacker a 50× speed-up. We've seen real LLM-generated “generate a random API key” snippets land in startup codebases. They are not random.
  • Fairness.Randomised A/B test assignment, lottery draws, treatment vs control in trials, shuffled question order in exams — all rely on uniformity. A biased “random” assignment quietly invalidates the experiment.
  • Statistics & simulation. Monte Carlo runs, bootstrap resampling, anything where you need to be sure you're exploring the full space of outcomes. A model that loves 47 will keep visiting the same neighbourhood.
  • Test data.If you ask an LLM to “generate 100 random sample customer IDs,” you'll get clusters around its favourite digits and miss classes of bugs your real users will trigger.

The fix is always the same: use the right tool. The next section tells you what that is.

013
Good practice if you actually need randomness

Good practice if you actually need randomness

Every modern language ships a cryptographically secure RNG backed by your operating system's entropy pool (Linux getrandom(2), macOS arc4random, Windows BCryptGenRandom). Use it.

Pythonsecrets.randbelow(100) + 1 · 1..100 uniform, crypto-strong
JavaScript (browser)crypto.getRandomValues(new Uint32Array(1)) then modulo, or use the rejection-sampling pattern to avoid bias
Node.jscrypto.randomInt(1, 101) · built-in, unbiased
Gocrypto/rand.Int(rand.Reader, big.NewInt(100))
Rustrand::rngs::OsRng.gen_range(1..=100)
Java / KotlinSecureRandom.getInstanceStrong().nextInt(100) + 1
PostgreSQLfloor(random() * 100) + 1 · not crypto-grade, fine for non-security

Two practical notes:

  • Avoid the modulo bias trap. Naive randomInt() % 100 on a non-power-of-two range introduces a small but real bias. Built-ins like crypto.randomInt and secrets.randbelow already handle this with rejection sampling.
  • Don't use Math.random() for security. It's a fast statistical RNG, not a cryptographic one. Fine for dice rolls in a game, fatal for session tokens.

If you absolutely have to use an LLM (e.g. you're generating plausibly-human adversarial test data), at least raise temperature (≥ 1.3for OpenAI/Anthropic chat models), include a system prompt that explicitly instructs uniform sampling, and verify the output distribution against this site. Don't trust it.