I keep seeing these Grok 4 intelligence claims, so I tried something very simple: "Animate a round robin tournament for 10 people."
Results:
Claude: ~10s, perfect working demo
ChatGPT: ~20s, solid solution
Grok 4: ~1000s, failed completely, gave me a truncated base64 blob
This wasn't some obscure edge case... it was basic data visualization that any decent model should handle. Yet somehow Grok 4 is "competing with humans" and has "99% tool accuracy"...
Grok 4 has about 99% accuracy in picking the right tools and making tool calls with proper arguments almost every single time.
Where did this number come from? What is "the right tool"? I find this extremely subjective. As most engineers know, there is no right tool, but mostly a compromise where you pick the least worst tool and choose what risks you're willing to manage or not.
On your intelligence graph where it shows Grok 4 and OpenAI o4-mini as comparable (and among the highest intelligence rated models), it doesn’t have OpenAI o3 or o3-pro.
Yet all of my tests show o3 blows o4-mini out of the water.
> Grok 4 is [...] the most intelligent model so far
A bit too much praise for a model that's barely ahead of the competition in a subset of benchmarks...
> To be honest, this model not only competes with other AI models but also with humans, making it the first of its kind
I'm out
I keep seeing these Grok 4 intelligence claims, so I tried something very simple: "Animate a round robin tournament for 10 people."
Results: Claude: ~10s, perfect working demo ChatGPT: ~20s, solid solution Grok 4: ~1000s, failed completely, gave me a truncated base64 blob
This wasn't some obscure edge case... it was basic data visualization that any decent model should handle. Yet somehow Grok 4 is "competing with humans" and has "99% tool accuracy"...
I don't buy it..
links: Claude: https://claude.ai/share/7a413a6a-5c01-44a1-aaed-8b237e5e9e94 Chatgpt: https://chatgpt.com/canvas/shared/687a9f9d4304819187ac7d98d3... Grok 4: https://grok.com/share/c2hhcmQtMw%3D%3D_20b61291-e1bb-45e5-a...
These benchmarks are either just wrong or measuring something completely divorced from practical utility imo...
Grok 4 has about 99% accuracy in picking the right tools and making tool calls with proper arguments almost every single time.
Where did this number come from? What is "the right tool"? I find this extremely subjective. As most engineers know, there is no right tool, but mostly a compromise where you pick the least worst tool and choose what risks you're willing to manage or not.
On your intelligence graph where it shows Grok 4 and OpenAI o4-mini as comparable (and among the highest intelligence rated models), it doesn’t have OpenAI o3 or o3-pro.
Yet all of my tests show o3 blows o4-mini out of the water.
What are you classifying as intelligence?
This article seems like pure garbage