The Cut-and-Paste Monstrosity You Might Already Be Reading

Saturday, May 24, 2025

Image Credit: https://unsplash.com/photos/a-lego-figure-holding-a-red-guitar-in-the-sand-OlomPsrE-uI

Imagine trying to tell a brand-new story, but 90% of your words have to be borrowed from other stories already written by someone else. You can’t rewrite them, you can’t paraphrase, and you can’t cite them with footnotes. You have to make it all feel seamless, coherent, and entirely relevant to a new prompt.

This is the central problem addressed by a recent research paper (from the University of Maryland) called Frankentexts, and it’s more than a quirky writing challenge. It hits at the heart of two fast-emerging issues in today’s AI-saturated business environment: how to control AI-generated content when you don’t want it to write freely, and how to detect when AI is involved in content that looks and feels human.

Most of the conversation around generative AI (especially large language models (LLMs) like OpenAI’s GPT-4 or Google’s Gemini) focuses on how these tools generate original content from scratch. But in the real world, that’s not always the goal. In many industries (publishing, legal, education, marketing, etc.), the more pressing need is for AI to repurpose, reuse, or reframe existing content while still delivering something tailored to a new context.

This creates a complex tension: you want the AI to pull from trusted human-written sources, but not to go rogue or inject obviously synthetic prose. At the same time, it needs to stay on-message, answer a new prompt, and produce something coherent. And even more complicated: today’s best AI-text detectors often can’t tell the difference between a human-written paragraph and a well-stitched hybrid (what the researchers call a “Frankentext”).

The research isn’t just about long-form writing; it’s about how we manage, detect, and judge AI-assisted content when it mimics human authorship so closely that even AI detectors get fooled. To test whether today’s LLMs can actually perform under such tight constraints, the researchers built a two-stage process that uses existing, publicly available models—no special training or fine-tuning required. The innovation lies in how the models are instructed and managed through prompts.

Drafting with a copy constraint: The process starts with a creative writing prompt, for example, “Write a story about a world where dreams can be traded like currency.” The model is then given a large batch of existing human-written story snippets (sourced from a public fiction dataset) and instructed to use these snippets as raw material. Crucially, it must generate a full-length story where 90% of the words come verbatim from these human-authored passages. The remaining 10% (the glue that holds it all together) is written by the model to create continuity, transition, and logical flow.
Iterative polishing for coherence: Once the initial draft is assembled, the same model is then asked to polish the story for coherence—fixing inconsistencies, awkward transitions, or mismatched grammar. But it must still honor the original constraint: the majority of the text must remain copied. This polishing step may be repeated several times until the model’s edits converge to a version that feels narratively consistent without introducing too much of its own language.

In other words, this isn’t freeform generation. It’s a highly controlled remixing process. The model is being asked not to create like a novelist, but to curate like a skilled editor, with strict limits on what it can add, change, or invent.

And remarkably, it works.

Once the researchers had built their two-stage system for producing Frankentexts, the next challenge was to see whether it actually worked (and more importantly, to understand how well it worked under real creative constraints). So, they designed a series of experiments to test not just whether the stories could be generated, but whether they held up under scrutiny from both machines and humans.

To ground the experiment in practical storytelling use cases, the researchers pulled 100 real prompts from a well-known online writing forum. These prompts covered a range of imaginative scenarios (everything from speculative sci-fi to character-driven fantasy). The idea was to reflect the kind of rich, open-ended narratives where coherence, tone, and creativity matter deeply.

The system was tested across five top-performing language models from major providers—including OpenAI, Google, and Anthropic. Each model was asked to generate Frankentexts in response to each prompt—applying the same strict requirement: the majority of the output must be reused human text, carefully stitched together into something that feels like a single, continuous story.

The research team didn’t just eyeball the outputs; they developed a multilayered evaluation framework to assess success in three key dimensions:

Relevance to the prompt: Did the story actually respond to the creative task? This isn’t just about hitting keywords; it’s also about narrative alignment. For example, if the prompt asked for a story about dreams as currency, did the final text focus on that concept? Was it central to the plot, or did the model drift into unrelated territory?
Narrative coherence: Could the story be read from start to finish as a logical, satisfying whole? Since these stories were built by piecing together prewritten fragments, there was a high risk of tonal shifts, character confusion, or contradictory plot elements. The researchers used both human annotators and an external AI model to evaluate whether the stories held together without falling apart.
Compliance with the copying constraint: Perhaps the most important and unusual benchmark: did the model obey the requirement to keep most of its output directly copied from human-written material? This was monitored using a mix of automated tools and attribution metrics to ensure the stories were indeed stitched together, not simply rephrased or hallucinated from scratch.

One of the most eye-opening findings came when the researchers tested the final outputs against modern AI-detection tools… the kinds used by educators, publishers, and enterprises to flag content that might be machine-generated. Surprisingly, many Frankentexts passed as fully human-written, even though they had been heavily orchestrated by an AI system.

In other words, these hybrid texts were often undetectable by today’s best detectors.

This raised a critical question: if LLMs can generate content that looks and feels authentically human while heavily relying on existing human text, how do we know when we’re reading (or publishing) AI-assisted writing?

While automated tools played a central role in this research, human readers were just as important. The researchers brought in human evaluators to assess things like tone consistency, grammar flow, and whether a story “felt human.” These subjective assessments added a necessary layer of judgment that machines alone couldn’t provide (especially in a medium as nuanced as storytelling).

Ultimately, the evaluation was about more than technical success. It was about legibility, believability, and attribution. Could this kind of AI-authored remixing produce outputs that audiences trust? Could professionals use it without crossing ethical or legal lines?

Those questions aren’t fully resolved, but the Frankentexts project reveals how close we are to needing real answers.

The evaluation methods in the Frankentexts study didn’t stop at checking whether the stories “worked” on paper; they were designed to surface where things might break down. What makes this research so relevant to business leaders and content creators alike is how clearly it exposes the gaps in today’s AI governance tools and practices.

Much of what passes for AI evaluation today revolves around surface-level metrics: fluency, grammar, and prompt alignment. The Frankentexts framework went several steps deeper by stress-testing a model’s ability to meet structural constraints (how much of the story is copied) while maintaining narrative integrity (does the story still make sense?). This dual requirement is something that most current content workflows aren’t designed to monitor.

But the real insight comes from what slipped through the cracks.

In many cases, models produced stories that felt coherent and followed the rules on paper (but human reviewers still flagged them for subtle discontinuities). A character might change personality mid-paragraph, or the tone might swing wildly between scenes. Even when copy constraints were followed perfectly, the final output didn’t always feel like a cohesive, human-written piece.

That feeling gap matters, especially in industries where tone, trust, and coherence aren’t optional. It also highlights how easily a polished-looking AI draft can mask real problems under the hood (problems that today’s automated detectors routinely fail to catch).

The Frankentexts pipeline is powerful, but it’s not without limitations (many of which reveal themselves at scale).

For one, the system’s success hinges on the quality of the human-written snippets it draws from. If those snippets are inconsistent in tone, genre, or complexity, the model struggles to blend them smoothly. This means that industries or organizations looking to adopt similar methods will need well-curated internal content libraries to avoid incoherent or awkward outputs.

Second, the approach currently requires a fair amount of computational power and iterative prompting to get a high-quality result. While it doesn’t require specialized training or infrastructure, the back-and-forth regeneration process (especially during the polishing stage) adds friction that may be impractical in time-sensitive workflows.

And finally, while the model obeys the letter of the copy constraint, it doesn’t always capture the spirit. That is, it might technically reuse the required number of words, but still assemble them in a way that feels mechanical or synthetic. This becomes particularly problematic when outputs are published in environments where originality, tone, and authorship are closely scrutinized (like academia, journalism, or legal drafting).

Perhaps the most compelling part of the Frankentexts research is how it reframes what we mean by “detection” and “attribution.” If AI can produce mostly human-written outputs and still fool leading detectors, then we need a better class of tools (ones that don’t just label a full text as “AI-generated” but can analyze which parts are human, which are machine-generated, and how they interact).

This invites a broader shift toward token-level provenance: tracing where each sentence, paragraph, or phrase comes from—whether human or machine. That’s a future vision with big implications, not just for intellectual property and ethics, but for collaboration. We’re entering an era where human-AI writing teams won’t just be possible—they’ll be commonplace. But making that work at scale will require transparency, standards, and better tooling than what we have today.

The Frankentexts project isn’t just a technical curiosity; it’s also a wake-up call. It challenges the assumption that we can easily separate human and AI-authored text in high-stakes contexts. It shows how generative models can follow complex constraints with surprising skill, while also revealing the limits of our current ability to judge or verify the origins of what we read.

As organizations across industries lean into generative AI to scale content, they’ll need to rethink not just how they create, but also how they evaluate, attribute, and govern. Frankentexts shows us both the potential and the pitfalls—offering a roadmap for what smarter, more responsible AI content workflows might look like.

Further Reading