📊 How Smart Are AI Models?
Attempting to answer that question through IQ testing
The Tracking AI project puts various language models through a set of IQ questions as defined by the Mensa Norway test as well as offline reasoning tasks to avoid overlap with training data. While “IQ” for AIs may be a controversial and maybe imprecise metric, the results offer a glimpse at how different models stack up on logic and problem-solving.
💡Did you know that the Tracking AI was made by fellow Substacker:
Be sure to check his publication 👇
Here’s how some of the most popular models performed:
Quick Takeaways
It’s a tight race to the top. OpenAI’s o3 leads with a score of 117, followed closely by Claude 4 Opus and Gemini 2.5 Pro.
Names are hard. For all the wonders AIs make, naming is not one of them and their naming scheme remains as confusing as ever.
Multimodal versions (those that make pictures) often score lower, likely due to less optimization for verbal/logical tasks.
AI changes FAST. GPT-4o and Bing Copilot landed at the lower end, remember when these took the world by surprise?
While these scores shouldn’t be taken too literally, they offer a consistent framework for comparing LLMs on reasoning-heavy tasks. Just be aware and don’t confuse this with emotional intelligence or real-world wisdom, we’re a long long way from that.
Which model do you feel is the smartest in daily use? Do you have a preferred model? Let me know in the comments or by restacking this post!
Thanks for reading! ✌️

