The paper’s practical point: a synthetic society that only produces plausible opinions is not yet simulating society.

Source note: Chance Jiajie Li, Jiayi Wu, Zhenze Mo, Ao Qu, Yuhan Tang, Kaiya Ivy Zhao, Yulu Gan, Jie Fan, Jiangbo Yu, Jinhua Zhao, Paul Pu Liang, Luis Alberto Alonso Pastor, and Kent Larson. “Simulating Society Requires Simulating Thought.” NeurIPS 2025 Position Paper Track, online October 29, 2025. https://openreview.net/forum?id=EvXWexakZX

Why This Paper Matters

LLM social simulations are becoming tempting.

Instead of running an expensive survey, a researcher can ask simulated voters how they might respond to a housing policy. Instead of convening a focus group, a product team can create synthetic users. Instead of modeling a city with hand-coded agents, a policy group can prompt a population of LLM personas and watch opinions move.

The problem is that plausible responses are easy to confuse with valid simulation.

This paper argues that many current LLM-based social simulations still follow a simple pattern: demographics in, behavior out. Give the model an age, income, neighborhood, job, ideology, or persona. Ask for a stance. Get fluent language back.

That can look useful. It can also be deeply misleading. The agent may produce a believable answer without representing how a person formed the belief, what assumptions support it, what would change it, or why a similar person might disagree.

The paper’s warning is blunt: if synthetic populations do not model thought, they can flatten public reasoning into average-sounding text.

The Idea in Plain English

The authors want social-simulation agents to carry belief structures, not just character descriptions.

A persona says: “I am a middle-income parent living near a proposed transit project.”

A belief model says: “This person connects transit access to commute time, commute time to family stress, density to school crowding, and public investment to neighborhood trust. If the policy increases transparency or improves school funding, the stance may change.”

That difference matters because social questions are causal. People do not merely hold opinions. They explain them, revise them, contradict themselves, and weigh tradeoffs through local experiences and values.

The paper calls this standard reasoning fidelity. An agent has reasoning fidelity when it can show a structured trace of how a belief was formed, revise that belief under a counterfactual change, and reuse pieces of reasoning across related situations.

In short: stop treating language as the simulation. Treat language as one interface to an underlying belief model.

What the Researchers Tested

This is a position paper, not a results paper with a new leaderboard.

The authors diagnose a failure mode in LLM social simulation, connect it to cognitive science, and propose two pieces of scaffolding:

  1. Generative Minds, or GenMinds: a modeling paradigm where agents represent belief formation through causal motifs and belief graphs.
  2. RECAP, or REconstructing CAusal Paths: an evaluation framework for testing whether agents can reconstruct, explain, and revise reasoning paths.

The paper uses social and policy examples, especially civic questions such as surveillance, housing, healthcare access, stakeholder modeling, and participatory policy design. The aim is not to prove that one architecture wins today. The aim is to define what a more faithful social simulator should be asked to model.

What They Found

Plausibility is the wrong target

The paper’s core critique is that output plausibility does not prove cognitive alignment.

An agent can say the kind of thing a stakeholder might say while having no stable model of the stakeholder’s assumptions. It can support a policy in one prompt, oppose a similar policy in another, and then produce a polished explanation either way.

That is not a small UX defect. In a social simulation, the explanation is part of the object being simulated. If the system cannot say why a belief changed, the simulation cannot be audited.

Chain-of-thought is not enough

The authors are skeptical that generated rationales solve the problem.

A chain-of-thought style explanation may be a post-hoc story assembled after the answer. It can make the model appear deliberative without giving users a faithful representation of the belief process that led to the answer.

The paper’s distinction is useful: form is not function. A rationale shaped like human deliberation is not the same as a belief structure that can be inspected, challenged, and updated.

For high-stakes simulations, the authors want agents to operate over explicit structures such as causal graphs, not only over fluent hidden-state continuation.

Synthetic groups can converge too easily

The paper also attacks the “multi-agent” version of the problem.

Putting many LLM agents in a room does not guarantee genuine diversity. If the agents share the same model priors, the group can drift toward a median, socially acceptable narrative. The result may look like consensus, but it can be a statistical artifact.

That matters for civic simulation. A synthetic town hall that converges smoothly may hide conflict rather than reveal it. It may replace hard disagreement with text that sounds reasonable to the model.

The authors call this an illusion of consensus.

Demographic personas can flatten people

Demographic conditioning has a related risk.

If an agent is prompted as a member of a broad demographic group, it may reproduce the most common correlations in the training data. The output can become a stereotype or a generic average rather than a positioned individual.

The authors are not saying abstraction is always bad. Large-scale simulation needs abstraction. Their point is that abstraction without a grounded model of beliefs, values, experience, and institutional exposure can erase the very heterogeneity the simulation is supposed to study.

For buyers of synthetic research tools, this is the uncomfortable part. More personas do not automatically mean more diversity. A thousand averaged agents can still be one averaged model wearing a thousand labels.

Why It Happens

The paper traces the failure to the architecture of the task.

Most LLM simulations are optimized around text continuation and behavioral alignment. They are good at producing answers that fit the prompt. They are weaker at preserving persistent belief states, causal dependencies, and principled belief revision.

Human reasoning is messy, but it is not structureless. People can hold contradictions, but those contradictions often have a history: values in tension, experiences that do not reconcile, institutional distrust, local incentives, family obligations, social identity, fear, hope, or practical constraints.

An LLM can produce contradictions without any of that record.

The authors argue for “grounded incoherence” rather than perfect logic. A social agent does not need to be a tidy theorem prover. It needs to show how a person could arrive at a conflicted view, what keeps the conflict alive, and what would resolve or intensify it.

What This Means for Builders

The practical design move is to separate persona from belief structure.

GenMinds proposes a workflow where semi-structured interviews elicit people’s causal explanations in ordinary language. Those explanations are parsed into motifs, which are small reusable causal units. For example:

Transparency -> Crime rate -> Public safety

or:

Transparency -> Privacy concern -> Opposition to surveillance

The motifs are then composed into a causal belief network for an individual or group. An agent can simulate an intervention over that graph. In the paper’s surveillance example, raising transparency lowers modeled privacy concern and reduces opposition to surveillance.

The exact numbers in the example are illustrative, but the pattern is important. The agent’s answer is not just a sentence. It is the result of an inspectable belief graph.

For product builders, this suggests a different stack for social simulation:

  1. collect situated reasoning, not only labels or survey answers;
  2. extract causal motifs from explanations;
  3. preserve individual variation in belief graphs;
  4. simulate interventions over those graphs;
  5. evaluate whether the reasoning path, not only the final stance, matches human reasoning.

That is slower than prompting personas. It is also much closer to the thing social simulation claims to do.

What This Means for Buyers and Operators

The buying question should change from “can this tool simulate users?” to “what internal structure is it simulating?”

If a vendor only shows fluent respondent quotes, synthetic survey tables, or multi-agent conversations, ask what persists underneath the text. Does the system maintain belief state across turns? Can it show the causal path from assumptions to stance? Can it explain why an intervention changed the answer? Can it preserve disagreement among similar people?

The paper is especially relevant for policy, civic planning, market research, and AI governance work. In those settings, a smooth synthetic consensus can become dangerous. It may give decision-makers the feeling that stakeholders have been consulted when the system has mainly generated plausible stakeholder-shaped language.

A good social simulator should make disagreement more legible, not easier to average away.

What to Watch Next

Benchmark shape: Watch whether RECAP-like evaluations become actual datasets or shared protocols, especially for civic and policy domains.

Graph extraction quality: Watch how well systems can extract causal motifs from messy interviews without forcing people into oversimplified diagrams.

Hybrid architectures: Watch for agent systems that combine LLM interfaces with explicit memory, causal graphs, probabilistic updates, and intervention testing.

Synthetic research vendors: Watch whether vendors expose reasoning traces and uncertainty, or keep selling persona output as if it were respondent truth.

Pluralistic alignment: Watch for evaluations that preserve multiple internally coherent views instead of rewarding one averaged answer.

Limitations and Caveats

This is a position paper. It offers a framework and research agenda rather than a completed benchmark result.

The causal-graph framing is useful, but it is not the whole of human reasoning. People also reason through analogy, association, emotion, habit, narrative identity, and social pressure. The authors acknowledge that causality is a tractable starting point, not the endpoint.

Extracting belief graphs from natural language is also hard. Concept boundaries are fuzzy. Causal direction can be ambiguous. People may revise what they mean while explaining it. A system that forces every belief into a clean graph can create a false sense of precision.

Still, the paper lands on a strong operating principle: if a synthetic society cannot show how its agents think, it should not be trusted as a model of society.

Source

Li, Chance Jiajie, Wu, Jiayi, Mo, Zhenze, Qu, Ao, Tang, Yuhan, Zhao, Kaiya Ivy, Gan, Yulu, Fan, Jie, Yu, Jiangbo, Zhao, Jinhua, Liang, Paul Pu, Alonso Pastor, Luis Alberto, and Larson, Kent. (2025). Simulating Society Requires Simulating Thought. NeurIPS 2025 Position Paper Track. Available at: https://openreview.net/forum?id=EvXWexakZX