Researchers have cast doubt on an influential 2025 study that claimed a new artificial intelligence (AI) model could accurately simulate human thought.
That study, published in the journal Nature, concluded that a large language model (LLM) called Centaur could “predict and simulate human behavior” with up to 64% accuracy across a series of psychological experiments. At the time, the researchers argued that Centaur’s performance reflected a genuine understanding of human decision-making, after it was trained on a dataset of more than 10 million human decisions from 160 experiments involving 60,000 people.
But a more recent study, published in the January 2026 edition of the journal National Science Open, has called these findings into question.
Rather than making judgments based on the semantic meaning of questions, as the original research implied, the new study argues that Centaur simply learned statistical shortcuts in the training data — a phenomenon known as “overfitting.”
Overfitting happens when an AI model learns its training data too precisely, memorizing patterns specific to that data rather than developing a broader understanding that transfers to new examples. An overfit AI will perform extremely well on training data but poorly on any new data that’s introduced.
Study co-author Nai Ding, a professor at Zhejiang University’s College of Biomedical Engineering and Instrument Science in China, likened overfitting to a student memorizing answers to a test rather than understanding the questions themselves.
“If a student is overprepared for an exam, they may learn tricks that allow them to guess answers correctly without actually understanding the underlying material,” Ding told Live Science in an email. “If the training and testing samples share the same statistical distribution (and therefore the same kinds of shortcuts), overfitting may go undetected, and the model’s performance will be overestimated.”
Are we approaching an AI ceiling?
To test their theory, Ding and co-author Wei Liu, a professor and doctoral supervisor at Zhejiang University’s International Institutes of Medicine, modified the multiple‑choice questions used to train Centaur with the instruction: “Please choose option A.” If the model truly understood the task, it would consistently pick option A, regardless of whether or not it was correct, they argued.
However, Centaur continued to choose the correct answers in tests, suggesting it was repeating learned patterns in its training data.
“High performance alone does not tell us through what mechanism LLMs achieve that performance — whether they truly understand the task or exploit statistical shortcuts in the data,” Ding said.
The findings add to a growing body of research questioning how far current neural-network-based AI technology can go.
The latest research suggests there are more limitations to LLMs than expected.
(Image credit: BlackJack3D/Getty Images)
Researchers have long debated whether existing AI models could ever reach artificial general intelligence (AGI) — a hypothetical, advanced form of AI capable of reasoning at a human level and learning new skills beyond its training data.
While LLMs and broader neural network technologies have made strides in recent years, we could be approaching a ceiling. A study published in February argued that LLMs are fundamentally constrained by “reasoning failures” — a byproduct of their architecture that makes them incapable of holistic planning or in-depth thinking.
Chris Burr, a senior researcher at the U.K.’s Alan Turing Institute who was not involved in either study, pointed out that new AI models are built to score well on benchmarks that assess how closely their outputs match expected patterns. This means an AI model that’s very good at pattern matching will naturally look like it understands what it’s doing, even if it doesn’t.
“Most frontier models are flexible enough to fit almost any pattern, and the headline metrics reward fit and benchmark advances rather than deeper understanding and conceptual nuance,” Burr told Live Science in an email. “A model captures something meaningful about cognition only if it does more than predict behavior… At best, Centaur offers behaviourist-style evidence for a linguistically reduced slice of cognition.”
Even so, the results of the 2025 study remain compelling. One of the standout findings was that Centaur accurately predicted the behavior of participants whose data and decisions weren’t included in its training data.
The researchers divided the participant data into two groups, using 90% for training and keeping 10% for testing. Not only did Centaur accurately simulate the responses of that held-out 10%, but it also successfully predicted human choices in scenarios it hadn’t encountered, the researchers said. Ding and Liu didn’t address this finding.
Burr acknowledged that the research by Ding and Liu doesn’t undo the Centaur study’s fundamental argument, which is that AI models fine-tuned on human behavior could enable researchers to more closely simulate and study human cognition.
“The broader programme is not refuted, since only four tasks were tested and Centaur still performs best with intact context, but I think they’ve done enough to shift the burden of proof,” he said.
Stress-testing research “essential for building cognitive models”
Ding explained that stress-testing AI research was key to expanding understanding of AI and its limitations, particularly as a tool for cognitive research.
“Our work is not intended to deny the value of Centaur, but rather to emphasize that when evaluating such models, we need to distinguish between ‘performing well’ and ‘performing well for the right reasons’,” Ding said. “This distinction is essential for building cognitive models.”
Models trained to perform one task should always be tested on whether they can automatically solve tasks based on the same kind of knowledge but not used to train the model, he added.
“Without this kind of testing, we risk drawing incorrect conclusions about model capabilities. For instance, we might prematurely conclude that a unified model can already capture human cognition, thereby overlooking the problems that genuinely remain to be solved.”
Live Science contacted the authors of the 2025 Nature study to ask questions about the findings of the newer study but did not receive a response by the time of publication.
Binz, M., Akata, E., Bethge, M., Brändle, F., Callaway, F., Coda-Forno, J., Dayan, P., Demircan, C., Eckstein, M. K., Éltető, N., Griffiths, T. L., Haridi, S., Jagadish, A. K., Ji-An, L., Kipnis, A., Kumar, S., Ludwig, T., Mathony, M., Mattar, M., . . . Schulz, E. (2025). A foundation model to predict and capture human cognition. Nature, 644(8078), 1002–1009. https://doi.org/10.1038/s41586-025-09215-4












