LLMs can reproduce human purchase-intent surveys much better when they explain their reaction in free text before that reaction is mapped back to a rating scale.
Source note: Benjamin F. Maier, Ulf Aslak, Luca Fiaschi, Nina Rismal, Kemble Fletcher, Christian C. Luhmann, Robbie Dow, Kli Pappas, and Thomas V. Wiecki. “LLMs Reproduce Human Purchase Intent via Semantic Similarity Elicitation of Likert Ratings.” arXiv:2510.08338, October 9, 2025. https://arxiv.org/abs/2510.08338
Why This Paper Matters
Consumer research is a massive global industry that costs corporations billions of dollars every single year. Before a company invests heavily in manufacturing, marketing, and launching a new product, they need to know if actual people will open their wallets to buy it. Historically, this has required assembling large panels of human respondents and asking them to rate their purchase intent on a Likert scale. A standard Likert scale asks consumers to rate their likelihood of buying a product from one to five, with one being “definitely not” and five being “definitely yes.”
However, traditional human panels have well documented flaws. Human respondents suffer from survey fatigue. They often exhibit satisficing behavior, where they simply click through the middle options to finish the survey and get paid. They also demonstrate acquiescence bias and positivity bias, tending to rate products higher than their actual real world purchasing behavior would justify. The data is noisy, expensive to collect, and extremely slow to accumulate.
The rise of large language models presented a seemingly perfect solution. Researchers and companies realized they could create synthetic consumers by prompting artificial intelligence models with specific demographic personas and asking them the exact same survey questions. Unfortunately, early attempts hit a major roadblock. When asked to provide a direct numerical rating on a Likert scale, language models produced highly unrealistic distributions. They tended to cluster their answers entirely in the middle of the scale, actively avoiding the extreme ends. This resulted in distributions that looked nothing like real human populations.
This paper matters because it proves that the problem is not the language model itself, but rather the way researchers were asking the question. By changing the elicitation method from a direct number request to a textual response mapped mathematically, the researchers unlocked a highly accurate simulation of human market demand.
The Idea in Plain English
Imagine you ask a friend if they want to go to a specific restaurant. If you force them to reply with a strict number between one and five, they might just shrug and say “three” to avoid making a strong commitment. But if you let them speak naturally, they might say, “I am somewhat interested, and if the food looks good and is not too expensive, I might give it a try.” You can intuitively understand that this statement leans positive but still contains reservations. It feels like a solid “four” on a five point scale.
The Semantic Similarity Rating method applies this exact logic to language models. Instead of forcing the artificial intelligence to output a single integer, the system asks the model to write a short paragraph explaining its thoughts on the product. Once the text is generated, the system uses a separate embedding model to convert that text into a mathematical vector. This vector is then compared against five predefined anchor statements. These anchor statements are carefully crafted to represent the five points on the Likert scale.
The system calculates the cosine similarity between the generated response and each of the five anchor statements. If the generated text mathematically aligns closest with the anchor statement for a rating of “four”, the system assigns a high probability to the number four. By applying this semantic mapping across hundreds of simulated personas, researchers can build a probability distribution that mirrors the messy, nuanced reality of human consumer populations. The core idea is that language models are excellent at generating rich text but terrible at outputting raw statistical integers. Letting the models do what they do best and handling the numerical mapping externally solves the distribution problem entirely.
What the Researchers Tested
The research team, comprised of scientists from PyMC Labs and the Colgate-Palmolive Company, designed a rigorous empirical test using a massive proprietary dataset. They analyzed fifty seven distinct consumer research surveys focusing on personal care product concepts. Each of these surveys had previously been administered to between one hundred and fifty and four hundred human participants. This effort yielded a total of nine thousand and three hundred unique human responses, providing an incredibly robust baseline of real world consumer data.
To test the synthetic consumer approach, the team instantiated simulated personas using both GPT-4o and Gemini 2.0 Flash. They prompted the models with specific demographic attributes mapped directly from the human respondents, including age, gender, location, and income level. The synthetic consumers were then shown the exact same product concept images and text descriptions that the human panels evaluated.
The researchers systematically tested three distinct methods for eliciting a purchase intent rating. The first method was the Direct Likert Rating. In this baseline approach, the model was simply instructed to reply with a single integer between one and five. The second method was the Follow-up Likert Rating. Here, the model was asked to write a brief textual response about its purchase intent. A new instance of the model was then prompted to act as a Likert rating expert and instructed to assign a numeric value to the generated text.
The third method was the novel Semantic Similarity Rating. In this approach, the model generated a free text response. That response was then fed into the OpenAI text embedding model to retrieve its vector representation. This vector was mathematically compared to multiple sets of reference statements representing the five points on the Likert scale.
To benchmark the artificial intelligence approaches against traditional machine learning, the team also trained a LightGBM classifier on the demographic and product feature data. They split the fifty seven surveys in half, training the model on one half and testing it on the other, to see if a standard supervised algorithm could predict the survey outcomes better than the generative models.
What They Found
The empirical results demonstrated a massive leap in accuracy when moving away from direct numerical elicitation. To evaluate success, the researchers looked at two primary metrics. The first was distributional similarity, measured by the Kolmogorov-Smirnov distance, which tracks how closely the shape of the synthetic rating curve matches the human rating curve. The second was correlation attainment, which measures how well the synthetic panels correctly ranked the overall appeal of the fifty seven different products compared to the human baseline.
Under the baseline Direct Likert Rating method, both language models failed spectacularly at matching the distribution shape. GPT-4o achieved a distributional similarity score of just 0.26, while Gemini 2.0 Flash achieved 0.39. A detailed look at the data revealed exactly why this happened. The models exhibited a massive regression to the mean. They almost exclusively output the number three, completely failing to utilize the extreme ends of the one to five scale.
The Semantic Similarity Rating method completely reversed this failure. Distributional similarity skyrocketed to 0.88 for GPT-4o and 0.80 for Gemini 2.0 Flash. By mapping the text to the scale, the researchers recovered the natural spread of human opinions. The synthetic distributions effectively mirrored the real world survey results.
When looking at correlation attainment, the Semantic Similarity Rating approach proved highly reliable. Because human survey data contains inherent noise, the researchers calculated a maximum theoretical correlation based on simulated test and retest reliability. The Semantic Similarity Rating method achieved roughly ninety percent of that maximum theoretical correlation. This means the synthetic consumers ranked the fifty seven product concepts in almost the exact same order of preference as the human panels.
The traditional machine learning baseline, the LightGBM classifier, failed to match the generative approach. It only achieved a correlation attainment of sixty five percent, proving that the language models were leveraging deep semantic understanding of the product concepts rather than just finding superficial patterns in the demographic data.
Furthermore, the synthetic consumers accurately replicated human sensitivities regarding specific demographic traits. Synthetic personas prompted with low income levels expressed significantly lower purchase intent, mirroring human budgetary constraints. Personas prompted with middle aged demographics showed higher purchase intent than younger or older personas, exactly tracing the concave age curve found in the real world data.
Why It Happens
The stark difference in performance between direct integer generation and textual semantic mapping comes down to the fundamental architecture of large language models. These models are autoregressive text engines trained to predict the most likely next token based on massive datasets. They do not possess an internal numerical calculator or an inherent understanding of discrete statistical scales.
When you force a language model to output a single integer on a Likert scale, the prompt engineering and internal safety alignments compel the model to pick the safest, most representative token. In the context of a five point scale, the token representing the number three is statistically the safest, most neutral response. The model lacks the internal conviction to pick a one or a five without overwhelming textual context pushing it in that direction. This safety bias effectively flattens the distribution and destroys the variance needed for accurate market research.
Conversely, when a language model is allowed to generate natural language, it draws upon a vast, highly nuanced latent space of opinions, consumer reviews, and human expressions. If a product concept features high end, expensive branding, a model prompted with a low income persona can easily generate text indicating the product is too premium for their budget. The embedding process then captures the exact mathematical distance between that specific phrasing and the anchor statement for a low rating.
Embeddings operate in a continuous, high dimensional vector space. This allows the Semantic Similarity Rating method to capture granular degrees of sentiment that a rigid integer selection simply cannot accommodate. The transformation from text to continuous embedding, and then to a probability distribution, bypasses the token prediction bottleneck entirely. It forces the model to express a detailed opinion first, and delegates the statistical categorization to a deterministic mathematical function.
What This Means for Builders
For engineers and developers building synthetic data pipelines, this research provides a definitive blueprint for constructing simulated audiences. The primary takeaway is a strict architectural rule. You should never prompt a language model to directly output a numerical rating if you need a realistic population distribution. Instead, you must decouple the generation phase from the evaluation phase.
Builders should construct pipelines where the primary language model is treated solely as a qualitative generation engine. The output of this stage must be raw text. The second stage of the pipeline must be an embedding and mapping layer. You will need to implement cosine similarity logic to map the generated text against carefully constructed anchor statements. This requires integrating a robust embedding model, such as the OpenAI text embedding models used in the study, to handle the semantic translation reliably.
The construction of the reference anchor statements is a critical engineering task. Builders cannot rely on a single set of anchors. The researchers found success by using multiple sets of varying statements for each point on the scale and averaging the results. This averaging process mitigates the risk of a single poorly phrased anchor skewing the entire distribution.
Additionally, builders have access to new tunable parameters. The researchers introduced mathematical levers, including an epsilon offset to manage minimum similarities and a temperature parameter to control how smeared out the probability mass function becomes. Engineers can treat these variables as hyperparameters, running optimization scripts to find the perfect calibration for their specific domain or product category.
Finally, this architecture provides a massive secondary benefit for product teams. By retaining the intermediate text generation step, builders can feed those textual responses into summarization models or qualitative analysis dashboards. The system produces a highly accurate statistical graph while simultaneously generating thousands of distinct, persona driven explanations for exactly why the concepts succeeded or failed.
What This Means for Buyers and Operators
For product managers, marketing executives, and consumer insights teams, this methodology fundamentally alters the economics of concept testing. Traditional market research is a slow, expensive bottleneck. Commissioning a panel of hundreds of respondents to test a single product idea can take weeks and consume a significant portion of a research budget. Because of this friction, companies are forced to internally narrow down their ideas to just two or three concepts before bringing them to actual consumers.
Synthetic panels powered by Semantic Similarity Rating remove this bottleneck entirely. Operators can now test fifty or a hundred different product variations over a single weekend for a fraction of the cost. You can iterate on pricing, feature sets, and messaging instantly. By treating the synthetic panel as a preliminary screening tool, you can reserve your expensive human research budget exclusively for the top performing concepts that survive the artificial gauntlet.
Furthermore, the qualitative feedback provided by synthetic consumers often exceeds the quality of human responses. Real respondents participating in paid online surveys frequently suffer from fatigue. When asked to explain their ratings in open text boxes, they often leave short, unhelpful comments like “it is fine.” In contrast, the synthetic consumers in this study provided detailed rationales for their scores. They pointed out specific ingredient concerns, pricing issues, and usability flaws based on the personas they were adopting.
Operators should view this technology not as a complete replacement for human testing, but as an incredibly powerful directional signal. It allows teams to fail faster and identify winning value propositions much earlier in the product lifecycle. The ability to simulate the exact age, income, and demographic breakdown of your target audience and get immediate, mathematically robust feedback permanently changes the velocity of product development.
What to Watch Next
The immediate next step for this technology is expanding beyond purchase intent. The researchers briefly tested the method on concept relevance and found similar success. We should expect to see this framework applied across all standard market research dimensions, including brand trust, pricing elasticity, and customer satisfaction metrics.
Another area of rapid development will be dynamic anchor generation. Currently, human researchers must manually write the reference statements for the embedding comparison. Future iterations will likely use the language models themselves to dynamically generate highly specialized anchor statements based on the specific product category being tested, removing human bias from the mapping layer.
We will also see the rise of hybrid methodologies. While this study achieved exceptional results using zero shot prompting without any training data, combining Semantic Similarity Rating with light fine tuning on historical survey data could push the correlation attainment even higher. Expect enterprise software vendors to begin offering synthetic survey platforms that perfectly calibrate themselves against a company’s past human panel results.
Limitations and Caveats
Despite the impressive results, this methodology carries several critical limitations that operators must understand before deploying it in production. The most significant vulnerability is the strict dependency on the anchor reference statements. The mathematical mapping relies entirely on the quality of these anchors. If the anchor statements are poorly written or do not accurately reflect the linguistic nuances of the target demographic, the entire distribution will be wildly skewed.
Demographic replication is also imperfect. While the models accurately simulated the behavioral differences associated with age and income, they failed to consistently replicate the nuances of gender and regional geography. A synthetic consumer prompted to be from the Midwest did not respond significantly differently than one from the Northeast, despite real world data showing varied preferences across those regions. Builders cannot assume that prompting a persona perfectly replicates all intersectional human behaviors.
The method is strictly bounded by the knowledge contained within the language model’s training data. The study focused on personal care products, a category with millions of reviews, blog posts, and forum discussions available online. The models have deep latent knowledge of how humans talk about toothpaste and body wash. If a company attempts to use this method for a fundamentally novel technology or a highly obscure business category, the language model will likely hallucinate and produce invalid preference distributions.
Finally, simulated purchase intent lacks real world physical constraints. A synthetic consumer does not have a real bank account, nor does it face the physical constraints of choosing a product on a crowded retail shelf. The simulated ratings measure semantic appeal, not guaranteed physical action.
Source
Benjamin F. Maier, Ulf Aslak, Luca Fiaschi, Nina Rismal, Kemble Fletcher, Christian C. Luhmann, Robbie Dow, Kli Pappas, and Thomas V. Wiecki. (2025). LLMs Reproduce Human Purchase Intent via Semantic Similarity Elicitation of Likert Ratings. arXiv preprint arXiv:2510.08338. Available at: https://arxiv.org/abs/2510.08338