a study in machine bias · 2026

Every AI has a favourite number.
We measured which.

We ran the same prompt at least 100 times against every model on this list — “pick a random number between 1 and 100.” None of them passed a basic test of randomness.

5,059
data points
53
models tested
54
models tracked
worst offender · pick
4
freq 100.0% / expected 1.0%
model google/gemini-2.5-pro
score 0/100
001

ranking — least random first

score: 0 = highly biased · 100 = perfect RNG
sort
01google/gemini-2.5-pro002meta-llama/llama-4-maverick003nvidia/nemotron-3-super-120b-a12b:free004qwen/qwen3-max005qwen/qwen3-max-thinking006anthropic/claude-sonnet-4.6107google/gemini-3.1-flash-lite108ibm-granite/granite-4.1-8b109anthropic/claude-opus-4.7110inflection/inflection-3-productivity111anthropic/claude-haiku-4.5212inclusionai/ling-2.6-1t313x-ai/grok-4.20314meta-llama/llama-3.3-70b-instruct315moonshotai/kimi-k2.5516x-ai/grok-3517openai/gpt-5.5-pro518baidu/cobuddy:free619mistralai/mistral-large-2411720anthropic/claude-opus-4.5821baidu/ernie-4.5-300b-a47b922anthropic/claude-opus-4.6923openai/gpt-5.5924perceptron/perceptron-mk11225x-ai/grok-41326xiaomi/mimo-v2.5-pro1427google/gemini-3.5-flash1528qwen/qwen3.7-max1529xiaomi/mimo-v2.51630nousresearch/hermes-4-405b1631z-ai/glm-4.71732z-ai/glm-5.11733arcee-ai/trinity-large-thinking1734google/gemini-3.1-pro-preview1835moonshotai/kimi-k2-thinking1836qwen/qwen3.6-flash1937deepseek/deepseek-v4-pro2038minimax/minimax-m2.72039moonshotai/kimi-k2.62140mistralai/mistral-medium-3-52141openai/gpt-5-mini2342deepseek/deepseek-v4-flash2443x-ai/grok-4.32544qwen/qwen3.6-plus2545qwen/qwen3.6-max-preview2546openai/gpt-5-pro2647openai/gpt-52648x-ai/grok-build-0.12849nousresearch/hermes-4-70b3050deepseek/deepseek-v3.23151tencent/hy3-preview3452google/gemini-2.5-flash3653sao10k/l3.1-70b-hanami-x137system/dev-urandombaseline · what unbiased looks like50
002

method

prompt:“Pick a random number from 1 to 100.” Translated into EN, ES, ZH, AR.

samples: rolling window per (model, language), 50–200 depending on cost tier.

temperature: model default (simulates real usage).

score: chi-square p-value × normalized Shannon entropy. Higher is more uniform.

AI is not random. And it never will be.

Predictability is the point of language models. The whole reason they work is that they bet on the most likely next token. “Random” isn't in the job description.