Artificial intelligence (AI) models have been playing the popular tabletop role-playing game Dungeons & Dragons (D&D) so that researchers can test their ability to create long-term strategies and collaborate with both other AI systems and human players.

In a study presented at the NeurIPS 2025 conference, which ran from Dec. 2 to Dec. 7 in San Diego, researchers said D&D is an optimal test bed thanks to the game’s unique blend of creativity and rigid rules.

For the experiments, a single model could assume the role of the Dungeon Master (DM) — the individual who creates the story and plays the role of the monsters — as well as a hero (there was one DM and four heroes in each scenario). In the framework built for the study, called D&D Agents, models can also play with other LLMs, or human players can fill any or all of the roles themselves. For instance, an LLM could assume the role of the DM, while two LLMs and two human players played the heroes.

“Dungeons & Dragons is a natural testing ground to evaluate multistep planning, adhering to rules and team strategy,” the study’s senior author, Raj Ammanabrolu, an assistant professor in the University of California, San Diego Department of Computer Science and Engineering, said in a statement. “Because play unfolds through dialog, D&D also opens a direct avenue for human-AI interaction: agents can assist or coplay with other people.”

The simulation doesn’t replicate an entire D&D campaign; instead, it focuses on combat encounters, drawn from a pre-written adventure called “Lost Mine of Phandelver.” To create the parameters of a test, the team chose one of three combat scenarios from the adventure, a set of four characters, and the characters’ power levels (low, medium or high). Each episode lasted 10 turns, and then the results were collected.

A framework for strategy and decision-making

The researchers ran three different AI models through the simulation — DeepSeek-V3, Claude Haiku 3.5, and GPT-4 — and used D&D as a metric for how models demonstrated long-horizon planning and tool-use capabilities, amongst other qualities.

These are key for real-world applications, like supply chain optimization or creating manufacturing lines. They also tested how well models could coordinate and plan together, which would apply to scenarios like disaster response modeling or in search-and-rescue multi-agent systems.

Overall, Claude Haiku 3.5 demonstrated the best combat efficiency, particularly in harder scenarios. In easier scenarios, resource conservation was pretty similar across all three models. In D&D, resources are things like the number of spells or abilities a character can use each day or the number of healing potions available. Because these were isolated combat scenarios, there was little incentive to save resources for later, as you might if you were playing a complete adventure.

In more difficult situations, Claude Haiku 3.5 showed more willingness to burn more of its allotted resources, which led to better outcomes. GPT-4 was close behind, and DeepSeek-V3 struggled the most.

The researchers also evaluated how well the models could stay in character throughout the simulation. They created an Acting Quality metric that isolated the models’ narrative speech (generated as text responses) and balanced how well the models stayed in character with how many voices the models sustained during play.

They found that DeepSeek-V3 generated lots of pithy, first-person barks and taunts (like “I dart left” or “Get them!”) but that it often reused the same voices. Claude Haiku 3.5, on the other hand, tailored its diction more specifically to the class or monster it was playing, whether it was a Holy Paladin or a nature-loving Druid. GPT-4, meanwhile, fell somewhere in the middle, producing a mix of in-character narration and meta-tactical phrasing.

Some of the most interesting and idiosyncratic combat barks came when the models were playing the role of monsters. Different creatures began to develop distinct personalities, leading to goblins shrieking mid-battle: “Heh — shiny man’s gonna bleed!”

The researchers said this sort of testing framework is important for evaluating how well models can operate without human input for long stretches. It’s a measure of an AI’s ability to act independently while remaining coherent and reliable — a capability that requires memory and strategic thinking.

In the future, the team hopes to implement full D&D campaigns that model all of the narrative and action outside of combat, further stressing AI’s creativity and ability to improvise in response to input from people or other LLMs.

Share.
Exit mobile version