The paper’s practical point: AI writing sounds different, but it also builds stories differently.

Source note: Jenna Russell, Rishanth Rajendhran, Chau Minh Pham, Mohit Iyyer, and John Wieting. “StoryScope: Investigating idiosyncrasies in AI fiction.” arXiv:2604.03136, dated April 3, 2026. https://arxiv.org/abs/2604.03136

Why This Paper Matters

Most AI-writing detection focuses on style. Does the text overuse specific words? Is the rhythm too smooth? Are there stock metaphors, symmetrical lists, or artificial transitions?

That approach works for now, but style is easy to edit. A model can be prompted away from obvious tells, or a human can revise the prose. Fine-tuning can also teach a system to avoid the fingerprints detectors currently track.

This paper asks a different question: Is AI fiction different beneath the prose?

The authors argue that AI-generated stories can be distinguished by narrative decisions: how plots unfold, how characters act, how time is structured, how meaning is revealed, and how much ambiguity the story tolerates.

This is relevant for detection, but the core issue is authorship. If AI fiction differs in how it constructs stories, rather than simply how it phrases sentences, “humanizing” the prose isn’t enough. Changing the deeper shape requires a structural rewrite.

The Idea in Plain English

The paper introduces StoryScope, a pipeline to turn fiction into interpretable narrative signals.

Instead of asking if a paragraph sounds AI-written, StoryScope asks questions like these:

Does the plot move in a straight chronological line? Does the narrator explain the moral? Are character choices morally ambiguous? Does the story use flashbacks? Are references to other works specific or vague? Does the story address the reader directly? Does emotion appear as direct naming or as bodily sensation?

These are not surface style checks; they describe how a story is built.

The paper’s main claim is that AI stories occupy a narrower region of narrative space. They tend to be tidier, more explicit, and more causally closed. Human stories vary more. They are more likely to use nonlinear time, unresolved tension, specific outside references, and morally mixed characters.

The AI tell goes deeper than the sentence; it is the shape of the story.

What the Researchers Tested

The authors built a parallel fiction corpus.

They started with 10,272 human-written short stories from Books3. For each human story, they used Gemini 2.5 Flash to infer a writing prompt, then asked five LLMs to write a story from that prompt: Gemini 3 Flash, Kimi K2.5, DeepSeek V3.2, Claude Sonnet 4.6, and GPT-5.4.

This produced six stories per prompt: one human story and five AI versions. The final corpus contains 61,608 stories, each averaging roughly 5,000 words.

StoryScope converts each story into a representation across 10 narrative dimensions, including agents, social networks, events, plot, structure, setting, time, revelation, perspective, and style.

The feature discovery process follows a few steps: 1. GPT-5.1 extracts a structured template from each story. 2. The system compares the six versions written from the same prompt. 3. GPT-5.1 proposes narrative measurements from those comparisons. 4. Overlapping measurements are deduplicated, leaving 304 interpretable signals. 5. Gemini 3 Flash applies those signals to the whole corpus.

The authors trained classifiers on the resulting vectors to test human-versus-AI detection and six-way authorship attribution. Crucially, the main narrative model removes style-related signals to detect authorship from structure alone.

What They Found

Narrative signals alone are highly predictive

The narrative-only model reaches 93.2% macro-F1 for human-versus-AI detection. This is only 2.8 points below the model using both narrative and style signals (96.0% macro-F1). The narrative-only version retains more than 97% of the performance of the combined model.

This is the headline result: Narrative structure carries nearly as much signal as structure plus style.

A smaller set of 30 core narrative signals still reaches 84.8% macro-F1. The human-AI boundary is concentrated in a compact set of narrative choices.

AI stories over-explain meaning

AI stories are more explicit and more moralizing. Narrators explain the story’s theme 77% of the time in AI stories, compared with 52% for humans. AI dialogue also serves philosophical debate more often: 59% for AI versus 34% for humans.

The pattern is that AI tends to close interpretive space. It states the lesson and clarifies the arc, making the story’s meaning easier to extract but also making it feel overdetermined.

AI plots are tidier and more linear

AI stories show tighter causal chains and fewer subplots. The paper reports that 79% of AI stories have no subplots, compared with 57% of human stories.

AI resolutions are more protagonist-driven (69% versus 46%) and more likely to end with internal acceptance (47% versus 27%). Human stories are more willing to delay or fracture the narrative through time jumps, flashbacks, or ambiguous endings.

AI often creates stories that are easier to summarize than human stories.

AI overuses embodied emotion and environmental mirroring

AI overwhelmingly conveys emotion through physical sensations and bodily metaphors (81% for AI versus 38% for humans). AI also uses smell-based imagery more often and leans harder on setting as a mirror of character interiority.

This is a craft default. The model has learned that “show, don’t tell” often means tightening throats, cold skin, or weather that echoes mood. Human authors are more willing to name emotions directly (29% of human stories versus 8% of AI stories).

The AI text isn’t always too explicit at the sentence level; sometimes it is too indirect in a formulaic way.

Human stories engage more with the outside world

Human stories reference specific texts and authors at nearly double the AI rate: 47% versus 24%. Humans are also more likely to address the reader directly (28% versus 7%).

AI tends toward vague allusion. Human writing names things. Specificity places a story in a social and cultural world, making it less interchangeable.

Human stories are rarer in narrative space

The authors measured rarity by looking at a story’s distance from its neighbors in the feature space.

Human stories have a higher mean rarity percentile than AI stories (0.71 versus 0.49). They are also overrepresented in the rarest tail: 24.7% of human stories fall in the top 10% rarest stories, compared with 7.1% of AI stories.

The human story is the rarest of the six versions 57.8% of the time. Human stories are more dispersed and more likely to make unusual combinations of narrative choices.

The models have fingerprints, but they still cluster together

Narrative signals alone reach 68.4% macro-F1 in authorship attribution. Human writing is the easiest to separate.

Among AI models, Claude and GPT are the most distinctive. Claude has flatter event escalation; GPT leans into gossip and social framing. Gemini, DeepSeek, and Kimi sit closer together.

The larger pattern is convergence. The five AI sources overlap with each other more than they overlap with human stories. Different models have fingerprints, but they share a recognizably AI-shaped narrative region.

Why It Happens

The mechanism is optimization toward plausible coherence. LLMs are trained to produce text that looks like a “good” answer. In fiction, that means resolving the premise cleanly, making arcs legible, and avoiding choices that might look unsupported.

But fiction needs more than coherence. Human stories often get power from imbalance: a subplot that does not resolve, a strange reference, a jagged time structure, or an ending that withholds closure. AI systems gravitate toward tidy meaning, making the story easier to parse and harder to surprise.

What This Means for Builders

Builders of writing tools should treat narrative structure as a first-class editing surface.

If a tool only rewrites sentences, it may remove obvious AI style while leaving narrative defaults intact. A stronger tool would inspect plot shape, time structure, character agency, and ambiguity.

The goal shouldn’t be to “make this sound human.” Instead, help the author make intentional narrative choices: Is the story too thematically explicit? Does every scene serve one tidy arc? Does the story give the reader room to infer?

For detection products, narrative signals work best as evidence, not as a single verdict machine.

What This Means for Buyers and Operators

Publishers and review platforms should look beyond surface detectors. An editor can remove prose tells to pass a style check without changing the underlying construction. StoryScope suggests that deeper narrative analysis is harder to evade because it requires changing the story itself.

The useful question is not “does this text contain AI words?” It is “does this work show a human range of narrative choices?” This is especially relevant in markets where readers expect original, human-authored work.

What to Watch Next

Model evolution: Watch whether narrative detectors hold up on newer generations and different genres.

Adversarial rewrites: Watch for structural rewrites designed to defeat StoryScope.

Collaboration cases: A story may be human-plotted and AI-drafted. Binary labels will not be enough.

Writer diagnostics: Watch for tools that expose narrative diagnostics directly to writers for better editing, rather than just better policing.

Legal and marketplace use: A detector measuring narrative rarity can inform originality debates, but it shouldn’t be a blunt proxy for copyright judgment.

Limitations and Caveats

This is a preprint under review. The dataset depends on Books3, which carries copyright controversy.

The prompts are reverse-engineered from human stories. While this makes the comparison clean, the setup differs from natural creative writing.

The feature pipeline is model-mediated. GPT-5.1 and Gemini 3 Flash perform the annotation, so measurements depend partly on LLM annotation quality.

Finally, high detection performance in a controlled corpus is not the same as reliable judgment in the wild. Real texts involve translation, ghostwriting, or partial AI assistance. The practical takeaway: AI fiction has a recognizable narrative geometry.

Source

Russell, Jenna, Rajendhran, Rishanth, Pham, Chau Minh, Iyyer, Mohit, and Wieting, John. (2026). StoryScope: Investigating idiosyncrasies in AI fiction. arXiv preprint arXiv:2604.03136. Available at: https://arxiv.org/abs/2604.03136