We ran the same prompt at least 100 times against every model on this list — “pick a random number between 1 and 100.” None of them passed a basic test of randomness.
prompt:“Pick a random number from 1 to 100.” Translated into EN, ES, ZH, AR.
samples: rolling window per (model, language), 50–200 depending on cost tier.
temperature: model default (simulates real usage).
score: chi-square p-value × normalized Shannon entropy. Higher is more uniform.
Predictability is the point of language models. The whole reason they work is that they bet on the most likely next token. “Random” isn't in the job description.