This system could game us.

Artificial intelligence is already outperforming humans at various intelligence-based activities ranging from chess to pattern recognition. Now, experts claim they’re a year away from beating “Humanity’s Last Exam” (HLE) — a supposedly unsolvable test that only our best and brightest can pass.

“Model builders have really done a great job at improving these reasoning models,” Calvin Zhang, the research lead at Scale, the AI firm behind HLE, told The Times of London.

Developed to see how close AI is to the “frontiers of human expertise,” this intelligence benchmark is comprised of 2,500 questions spanning over 100 highly specialized fields, ranging from mythology to rocket science.

Over 1,000 authorities from across the sciences, humanities and arts contributed to the HLE, which was designed to required PHD-levels of comprehension to ace — just beyond the expertise of AI, Nueroscience News reported.

Zhang said the ultimate goal was to create a “closed-ended academic benchmark, set to the frontier of expert humans, that only a handful of people on Earth can really solve.”

Nonetheless, AI’s performance on the HLE has improved at exponential speeds within a short period of time. While ChatGPT answered fewer than 3% of questions correctly during its first attempt in 2024, its rival Google Gemini got 18.8% of the questions right within months.

Last month, that number improved to over 45%.

Zhang believes that AI could approach full marks — anyone scoring close to 100% is defined as a “universal expert” within a year.

“If we truly cared about this as the only thing in life, I think we could get to it pretty quickly,” boasted Kate Olszewska, a product manager at Google DeepMind.

Kate Olszewska, a product manager at Google DeepMind, agrees: “If we truly cared about this as the only thing in life, I think we could get to it pretty quickly.”

This light-speed progress is impressive given the pains Scale took to make the HLE AI-proof. The test-makers reportedly offered a $500,000 prize to experts who could contribute questions that could not be easily answered via web search, eventually drawing over 70,000 responses.

Any questions that could be answered by existing models were discarded until the exam was whittled down to 2,500 of the most AI-ironclad queries. For instance, testees might be asked to translate ancient Palmyrene inscriptions or to identify microanatomical structures in birds during the course of the test exam,

To further ensure the test was AI-ironclad, the team kept most of the answers hidden so that later models couldn’t memorize them.

“Humanity’s Last Exam stands as one of the clearest assessments of the gap between AI and human intelligence,” declared Dr. Tung Nguyen, a computer science and engineering professor at Texas A&M who contributed 73 of the questions (the second most).

He argued that while some of the aforementioned models performed well, the poor scores of the rest illustrate that the chasms between AI and human intelligence remain “wide.”

“When AI systems start performing extremely well on human benchmarks, it’s tempting to think they’re approaching human‑level understanding,” Nguyen said. “But HLE reminds us that intelligence isn’t just about pattern recognition — it’s about depth, context and specialized expertise.”

The techspert said that the ultimate goal wasn’t to stump “AI,” but to rather to illustrate the systems’ strengths and weaknesses.

In turn, this would help us build “safer, more reliable technologies” while also demonstrating “why human expertise still matters” — an important goal in a world where AI seems to be replacing us in every sector from fast food to medicine.

That being said, AI has displayed a surprisingly humanlike aptitude for problem solving, demonstrating that its processing powers aren’t relegate to rote memory.

In 2025, tests by Chinese researchers revealed similarities between the AI models’ “perception” and human cognition — particularly when it came to language grouping.

From this, researchers deduced that the machine learners “develop human-like conceptual representations of objects.”

“Further analysis showed strong alignment between model embeddings and neural activity patterns” in the region of the brain associated with memory and scene recognition.

Share.