AI dangerously close to solving test that only the brightest minds on Earth could: ‘Human expertise still matters’

This system could game us.

Artificial intelligence is already outperforming humans at various intelligence-based activities ranging from chess to pattern recognition. Now, experts claim they’re a year away from beating “Humanity’s Last Exam” (HLE) — a supposedly unsolvable test that only our best and brightest can pass.

“Model builders have really done a great job at improving these reasoning models,” Calvin Zhang, the research lead at Scale, the AI firm behind HLE, told The Times of London.

Developed to see how close AI is to the “frontiers of human expertise,” this intelligence benchmark is comprised of 2,500 questions spanning over 100 highly specialized fields, ranging from mythology to rocket science.

Over 1,000 authorities from across the sciences, humanities and arts contributed to the HLE, which was designed to required PHD-levels of comprehension to ace — just beyond the expertise of AI, Nueroscience News reported.

Zhang said the ultimate goal was to create a “closed-ended academic benchmark, set to the frontier of expert humans, that only a handful of people on Earth can really solve.”

Nonetheless, AI’s performance on the HLE has improved at exponential speeds within a short period of time. While ChatGPT answered fewer than 3% of questions correctly during its first attempt in 2024, its rival Google Gemini got 18.8% of the questions right within months.

Last month, that number improved to over 45%.

Zhang believes that AI could approach full marks — anyone scoring close to 100% is defined as a “universal expert” within a year.

“If we truly cared about this as the only thing in life, I think we could get to it pretty quickly,” boasted Kate Olszewska, a product manager at Google DeepMind.

Kate Olszewska, a product manager at Google DeepMind, agrees: “If we truly cared about this as the only thing in life, I think we could get to it pretty quickly.”

This light-speed progress is impressive given the pains Scale took to make the HLE AI-proof. The test-makers reportedly offered a $500,000 prize to experts who could contribute questions that could not be easily answered via web search, eventually drawing over 70,000 responses.

Any questions that could be answered by existing models were discarded until the exam was whittled down to 2,500 of the most AI-ironclad queries. For instance, testees might be asked to translate ancient Palmyrene inscriptions or to identify microanatomical structures in birds during the course of the test exam,

To further ensure the test was AI-ironclad, the team kept most of the answers hidden so that later models couldn’t memorize them.

“Humanity’s Last Exam stands as one of the clearest assessments of the gap between AI and human intelligence,” declared Dr. Tung Nguyen, a computer science and engineering professor at Texas A&M who contributed 73 of the questions (the second most).

He argued that while some of the aforementioned models performed well, the poor scores of the rest illustrate that the chasms between AI and human intelligence remain “wide.”

“When AI systems start performing extremely well on human benchmarks, it’s tempting to think they’re approaching human‑level understanding,” Nguyen said. “But HLE reminds us that intelligence isn’t just about pattern recognition — it’s about depth, context and specialized expertise.”

The techspert said that the ultimate goal wasn’t to stump “AI,” but to rather to illustrate the systems’ strengths and weaknesses.

In turn, this would help us build “safer, more reliable technologies” while also demonstrating “why human expertise still matters” — an important goal in a world where AI seems to be replacing us in every sector from fast food to medicine.

That being said, AI has displayed a surprisingly humanlike aptitude for problem solving, demonstrating that its processing powers aren’t relegate to rote memory.

In 2025, tests by Chinese researchers revealed similarities between the AI models’ “perception” and human cognition — particularly when it came to language grouping.

From this, researchers deduced that the machine learners “develop human-like conceptual representations of objects.”

“Further analysis showed strong alignment between model embeddings and neural activity patterns” in the region of the brain associated with memory and scene recognition.

What's On

CNN panelist S.E. Cupp tees off on Gavin Newsom: ‘I don’t like him’

How the ultra-rich are skipping TSA lines, clogged airports as chaos spirals

The ‘whoosh effect’: Beware the keto diet myth that can lead to serious issues

Exclusive | Angry parents push for legislative action after Meta court ruling: ‘Phones should be illegal for kids until they’re 18!’

How Square and other payment apps push controversial tipping policy on diners

Exclusive | YouTube staffers deliberately aimed for ‘viewer addiction,’ killed safety tools for kids: court docs

AI chatbots are prone to frequent fawning and flattery— and are giving users bad advice because of it: study

The jobs most vulnerable to AI — as new study predicts 9 million American workers to be displaced by bots in 5 years

Wikipedia officially bans AI-generated content — relying on human editors for bot detection

Inside the fallout between Gavin Newsom and Elon Musk, as their war of words gets worse by the day

Sony jacking up PlayStation 5 prices for 2nd time in less than a year — here’s how much it will cost you

Gen Z Takeover: Sheryl Sandberg taps 25-year-old to lead her nonprofit — sending senior staff packing: report

What's On

AI dangerously close to solving test that only the brightest minds on Earth could: ‘Human expertise still matters’

Related Articles