Chollet's ARC-AGI-3: Humans Ace It, GPT-5.5 Scores 0.43%
François Chollet left Google to build ARC-AGI-3, a reasoning benchmark where humans score 100% but frontier LLMs like GPT-5.5 manage just 0.43%, exposing the gap between pattern matching and true intelligence.
François Chollet, the creator of Keras and one of the most influential voices in deep learning, has left Google to pursue what may be the most important question in AI: can machines actually reason, or are they just very good at memorizing patterns? His answer comes in the form of ARC-AGI-3, the latest iteration of a benchmark that humans solve effortlessly while the world's most advanced large language models struggle to score above 1%.
The numbers are stark. On ARC-AGI-3, human test-takers achieve a perfect 100% success rate. GPT-5.5, OpenAI's frontier reasoning model, manages just 0.43%. That gap — four orders of magnitude — is precisely the point.
Why Chollet Walked Away From Google
Chollet spent nearly a decade at Google, where he built Keras into the most widely used deep learning library on the planet. But he has also been the field's most prominent skeptic of the scaling hypothesis — the idea that simply throwing more parameters, data, and compute at transformers will eventually produce general intelligence.
In 2024, Chollet co-founded Ndea, a new lab focused on building AI systems capable of genuine abstraction and on-the-fly reasoning rather than pattern retrieval. ARC-AGI is the measurement stick. If you can't define intelligence operationally, Chollet argues, you can't claim to have built it.
What Makes ARC-AGI-3 Different
The Abstraction and Reasoning Corpus (ARC) was first introduced by Chollet in his 2019 paper On the Measure of Intelligence. Each task presents a small grid puzzle with two or three input-output examples, and the system must infer the underlying transformation rule and apply it to a novel input. The puzzles require core priors that humans develop in early childhood — object permanence, geometry, counting, symmetry.
Crucially, ARC tasks are designed to be resistant to memorization. Each puzzle is unique. There is no training set that overlaps with the test set in any meaningful way. A model that has digested the entire internet gets no advantage, because the answer cannot be looked up — it must be reasoned out.
ARC-AGI-3 raises the bar further. It introduces interactive, multi-step environments and richer compositional structure, while still remaining trivial for human solvers. It is, in essence, a stress test for fluid intelligence.
Why LLMs Collapse
Frontier models like GPT-5.5, Claude, and Gemini are extraordinary interpolators. Given a problem that resembles something in their training distribution, they can produce remarkably fluent and often correct answers. But ARC-AGI-3 is engineered to fall outside that distribution by construction.
What the benchmark exposes is a fundamental architectural limitation: transformer-based LLMs perform approximate retrieval over a learned manifold. They do not, in any rigorous sense, construct new programs on the fly. When a task requires synthesizing a novel rule from two examples and executing it precisely, the statistical machinery breaks down.
Chollet's proposed alternative, which Ndea is reportedly pursuing, combines deep learning with program synthesis — search over a space of discrete programs guided by neural intuition. This hybrid approach has produced the strongest ARC results to date, well above pure LLM baselines.
Implications for Synthetic Media and Authenticity
Why does this matter for the AI video and synthetic media space? Because the same architectural limits that prevent LLMs from solving ARC also shape what generative video models can and cannot do. Today's diffusion-based video generators excel at interpolating within their training distribution — producing photorealistic clips that resemble what they have seen. They struggle with novel physical reasoning, consistent multi-object interactions over long horizons, and tasks that require true causal modeling.
If Chollet is right, the path to video models that genuinely understand the scenes they generate — rather than merely sampling plausible pixels — runs through the same kind of program synthesis and reasoning architecture he is championing. Detection of synthetic content may also benefit: systems that can reason about scene consistency, rather than just spot statistical artifacts, would be far harder to fool.
The Bottom Line
ARC-AGI-3 is not a claim that current AI is useless. It is a claim that the industry's headline benchmarks have been quietly drifting toward measures that reward scale rather than intelligence. With humans at 100% and GPT-5.5 at 0.43%, Chollet has handed the field a problem it cannot benchmark its way out of — and a reminder that the road to AGI may require more than another order of magnitude of GPUs.
Stay informed on AI video and digital authenticity. Follow Skrew AI News.