Menu

AWP provides community, opportunities, ideas, news, and advocacy for writers and teachers of writing.

Scraping the Writer

What’s Missing and What’s Stolen in Generative AI Writing

Erika Swyler | November 2023


Erika Swyler

As a speculative fiction writer and otherwise exhausted person, I’ve come to believe that tech industry leaders who claim science fiction as inspiration have failed to understand any of that genre’s important parts. They’ve ignored complex social and political explorations in favor of what they feel is cool. Some of the best writing on damning human exploitation lies firmly in the realm of science fiction. Yet when it comes to Large Language Models, AI developers have ignored the parts of science fiction centered on what it means to be human. Instead, they’ve focused on problems that don’t exist by creating software that—through theft—can generate a “book” in eight hours. 

To be blunt, a number of us working in speculative fiction deserve an apology from the tech community for how deeply boring and ultimately inhumane generative AI is in its development of LLMs. I had hopes that true artificial intelligence might one day aid in our understanding of the human brain, or perhaps help our species survive climate change. Machine learning is already used in drug development, reducing the time it takes to discover new molecules. But to use generative AI in the humanities? Where the root of the word is human, and where truth is a notoriously changeable thing? That strikes me as an insult, a waste of processing power, and an abuse of planetary resources. Speculative and science fiction writers have considered AI for decades, and—after scraping all of our writing—tech has come up with what functions as a plagiarism blender. What an utter failure of imagination. Though, I suppose lack of imagination is the point when it comes to generative AI writing.

I could see it as flattery that my novels were stolen to train AI. Zadie Smith’s books have been scraped, and she’s a genius. Stephen King’s enormous body of work has been scraped, but so too have we less prolific folks. Me and my good buddy, Steve. Is it flattering that someone might want to learn from my books? Yes. In fact, I help people do that through free talks and book clubs, mentoring, essays, panels at conferences, a newsletter, and all the unpaid labor that’s part of being a good literary citizen. The learning isn’t the problem. Nor is the theft (it’s impossible to stop every act of literary piracy). The problem is that despite books being scraped verbatim, the work Large Language Models generate is very much missing something. The best way to correct absences in a dataset is through close examination of the sources it is gathered from. I enjoy helping people, so, for anyone wanting to learn my style or train their LLM (you may want to—I’ve moved a lot of books) I’ll let you in on what the scraped data omitted.

For an LLM to accurately emulate my writing, you’ll have to chat with it about choosing a language set that indicates a slightly to severely off-kilter suburban childhood. It’s helpful if you can input memories of being raised by a scientist at a national laboratory and an artist—preferably a ceramicist, but a painter will do in a pinch. Throw in an older sibling. Siblings help writers develop a wide range of relationship dynamics within a story.

It’s difficult not to lean into sarcasm when writing about this specific manifestation of AI (chatbots in a previous incarnation). Sarcasm prevails because the product is terrible, and those invested in it are earnestly trying to convince us that it isn’t. If that sounds harsh, I write it in part because Elon Musk was an early funder of OpenAI. Any concerns he raised about its development seem to have been to stifle other companies until he could found xAI and begin to catch up. As an author who has studied circus ringmasters and carnival barkers, I’m familiar with his type. The thing to understand about Musk, carnival barkers, and LLMs, is that what they’re selling is never exactly what it purports to be. From exploding cars to exploding rockets, and exploding social media platforms, poor quality and recklessness run rampant. In the Large Language Model aspect of AI, that means a subpar product that returns misinformation, makes up caselaw citations, and simply doesn’t write well. We’re already accustomed to mediocrity and failure with language AI, so developers have little to rely on for a quality standard on which to build. Grammar check is a longstanding example of mediocre language AI we’ve used for ages, despite how laughably ineffective it is for anyone interested in style. Add to that autocorrect. Have you ever meant to text the word duck? How often has grammar check let the typo “pubic” stand in for “public”? Language models have never worked well with words that have multiple meanings, the artistic elements of language, or inference—essential elements for making what we write readable. LLMs are also unable to differentiate facts from misinformation. Everything an LLM generates without significant human oversight is fiction. We used to call this fact checking. If useful tech isn’t all that useful when it comes to writing, can one have faith in something that’s intended to pallidly mimic human thought?

Despite tech’s efforts, it’s impossible to make human labor obsolete. I’ve been a proofreader and a transcriber, both jobs that tech has attempted to do away with. Grammar check and spell check were supposed to reduce the need for proofreading. Yet, the most effective way to catch typos and errors in copy remains someone reading copy aloud while another person follows along with a red pen. Speech-to-text software was meant to eliminate the need for transcribers. Part of my job as a transcriber entailed listening to audio and correcting the transcription software’s mistakes. Factoring in software cost, the client spent more money than they would have had they worked solely with a transcriber. The programs had a complete inability to recognize accents, rhythm in speech, or decipher muffled audio. I’m highly skeptical of LLMs’ future because they’re built by and trained on human labor—finished work—which makes them a copyright nightmare. Broad sections of tech culture haven’t yet reckoned with how much of publishing is supported by deep backlists like Stephen King’s and Nora Roberts’s, and how much money publishers are willing to eat to go to court to protect that income.

To be blunt, a number of us working in speculative fiction deserve an apology from the tech community for how deeply boring and ultimately inhumane generative AI is...

But, if you’re going to get a generative AI to write “original” fiction in my style, without producing an indigestible book smoothie, you’ll need more information to train it, since writing lives in nuance. So, you’ll need to limit vocabulary choices to those typical of white, middle- to upper-middle-class people, living in the northeastern United States, but with the occasional southern New Jersey slang thrown in. Long Island, north shore, but comfortable conversing outside. The dataset must include standard words and phrasing encountered in four years of studying English literature and theater at an overpriced private university. Then you’ll need to instruct it on tone. If race and class weren’t difficult enough to discuss with a machine, your LLM will also need to model a mood disorder and the way mood disorders influence language and thought patterns. The DSM-5 description of major depression is easily at hand—assume that’s within the LLM’s working vocabulary, since textbooks and references are widely pirated. Dysthymia may also apply. Anxiety disorder, yes, but that shouldn’t be the overarching input. Do you feel odd about teaching a program to mimic mental illness? Good. Yet, this specificity is essential because writing is rhythm, dialect, accents, and moments of clarity and chaos. Writing, in my attempts, needs a sense of rootedness, a place where it writes from. Your LLM will need to operate with the sense that it “lives” in an inherited home where every repair or renovation writes over history and the evidence of loved ones. It will also need to draw from eleven years of living in Brooklyn, and two years in Florida. The Florida episodes expand the LLM’s access to descriptions of heat, large insects, tourist culture, the grotesque, and a form of lawlessness that is its own flavor of Americana. Don’t be shocked when your LLM spits out an essay on the cultural importance of kitsch.

The vast majority of technological advancement is meant to replace some form of human labor. In the case of LLMs used for writing, the labor is thought, though the technology itself is incapable of thought and behaves more like a pattern learning machine. From a societal standpoint, most modern efforts to replace human labor have come from people who value neither humans nor labor. OpenAI demonstrated this clearly when it hired Kenyans to screen scraped data and paid them less than $2 per hour. As used in programs like ChatGPT, generative AI exploits humans to solve a problem that doesn’t exist. Humans write, many of us write well, and even enjoy it. If you’re able to overlook the plagiaristic elements of working from scraped data, you’re still relying on the labor of exploited people to do something you could do yourself or pay someone fairly to do. You may not be doing the work when using generative AI, but someone did. The labor didn’t magically disappear—it’s been disguised to hide abuse. In using an LLM to generate a page in the style of Stephen King, say something terrible happening to a group of small children, you’re exploiting King’s work, the labor of anyone who had to screen text and images associated with King, and—pardon the speculative reach—the planet. All the processors and data silos, all the machines used to run Large Language Models, require power and cooling—lots of cooling. How much of that runs on fossil fuel? The answer isn’t zero. Add to that the excitement the film and television industries have about AI, all the spidery legs of tech giants scraping the entire internet, and you get an outsized carbon footprint that hastens global warming and mass extinction, all for the sake of . . . not paying people.

For many of us, the point isn’t a product, but to be changed by the writing process.

You’ll encounter difficulties getting your LLM to write an essay on AI in my style. It should be an easy ask, as there’s a glut for it to scrape (apologies for adding to it), and new developments happen daily. It would be even more difficult to generate an essay that doesn’t pull corporate puff copy from companies who intend to use AI to downsize their workforce. But my style as an essayist differs from my style as a novelist, and the vast majority of my work that’s been scraped is from my novels. There’s a chance that instead of writing this essay, an LLM would generate word pablum involving an eye injury and anachronistic phrases belonging to a nineteenth-century circus ringmaster. As they say, shit in, shit out. My voice as a short story writer is almost nonexistent on the internet, partly due to digital decay, something we should consider with regard to AI. If digital decay holds true to form, as the internet is not infinite, eventually my voice as an essayist will also disappear.

So, what doesn’t change? The actual source data. For an LLM to generate an accurate “Erika Swyler” work, you’ll need to instruct it to factor in the lifelong impact of a failed suicide attempt as a teenager. This is important for the dataset, as it colors writing on altruism, the interconnectedness of people, the U.S. hospital system, and on the body and its responses to pain. There are many memoirs on suicidality for your LLM to steal from if needed. Your LLM should account for four years’ study in a famous theater program. This adds a working knowledge of Shakespeare, American experimental theatre from the 1960s to 2000s, twentieth-century Eastern European dramatists, Indian dance theater, a facile emotional range, a giant raw nerve that never goes away, the ability to cry on cue and also be endlessly crying, a propensity to subject oneself to emotional trauma while others watch, desperation for approval, and routine public humiliation. Now your LLM will have skill with dialogue, character psychology, and character physicality. That’s almost enough for the entry-level emotional range of a debut novel, but you’ll need more for the LLM to have depth. Prompt it to include language about the suicide of a family member. You must include this to generate my style as it’s the engine for long melancholic phrases and bitter two-word responses.

Since the early days of Facebook, tech has adopted the inane, repulsive motto, “Move fast and break things.” Google, so synonymous with searching that it became a verb, has committed to this wholeheartedly. In their hasty development of their tool Bard, which runs on their own LLM model, LaMDA (Language Model for Dialog Applications), the company has already broken itself. What was once an extraordinary instrument—particularly for writers needing to research—now regularly returns misinformation. AI in this sense works like a game of telephone, only at one end there’s a calculator, and at the other there’s a pile of words. In its rush to claim all writing on the internet as its property, Google ruined the one thing it was consistently good at. And now the company would like you to further train its AI through your text messages—for free, of course. It’s just learning. Nothing nefarious.

Should art ever adopt a motto, may it be along the lines of “Move slowly and with great care.” 

For your generative AI to learn me and make my work, you’ll have to instruct it to choose language patterns indicative of a long and strange employment history. I was a late bloomer. So, it needs to draw from a work history in carpentry, landscaping, stagehand, secretarial jobs, art studio assistant, proofreading, hotel convention sales, legal transcribing, and museums. Your AI can’t be in a rush to make art; to mimic my writing, it needs to gather information over time. The work history increases the dataset to blue-and white-collar work, and also allows it to generate character voices of different classes with a tone of authority. In the arts, we might call this range. Since it’s often related to work, instruct your LLM on having a nightmare disorder. That pirated DSM-5 probably describes it well. This one’s fun—it creates body horror text. While neural nets already generate the kind of body horror phrases a nightmare disorder fuels, LLMs are more sophisticated word blenders, and can almost mimic intention. 

Intention raises another question—why generate work in my style? For art? For content? AI is often touted as useful in content creation. Content, the catch-all word of late-stage capitalism that’s nearly inseparable from an end goal of monetization. It’s marketing copy, memes, videos, social media posts, political ads, almost anything posted online. It’s fine to respect the content hustle—labor is labor—but we should acknowledge that content hustle is separate from pursuing art. People making art tend to refer to their chosen medium by its name. For me, a major differentiation is that it’s easy to doom scroll content, whereas it’s nearly impossible to doom scroll art. Whether I succeed at it or not, my intention is always art. Emphasize that to your LLM, if it processes emphasis. 

From a scraped writer’s perspective, the most important thing that companies developing LLMs for generative AI fail to understand about writing is that the act is enjoyable, even when difficult. It’s time to thoughtfully analyze something, to play, and choose words that please the senses (for inclusivity, I opted to avoid the words eyes or ears, LLMs need to be prompted to do this). Complaining about writing is joyful, satisfying to the soul, and most writers’ favorite pastime. Generative AI used for writing works on the assumption that the goal is a final product, as opposed to the process, skipping the joy and the complaining. For many of us, the point isn’t a product, but to be changed by the writing process. Every writer has finished work that no one but themself will read; often it’s a much higher percentage of their work than what gets published. Those unread, unfinished attempts matter as much, if not more, than the published work, because the act of writing works on the brain as it works on the page. That can’t be outsourced, nor should it be. 

An accurate Erika Swyler LLM will have to do time compression in several areas of acquired knowledge. It will have to process two decades or so of casual sketching and doodling and operate with the understanding that in the course of generating one of my novels, an illustration may pop up. This means your LLM will have to converse with something like Midjourney or Dall-E. AI-generated illustration introduces a new set of copyright issues, and artists can now glaze images to prevent AI training. Once AI companies inevitably charge the masses at every level, this venture could get expensive. Whichever AI you use, suggest illustrations of animals, pencils over pens, the wish to be more skilled, and a total lack of desire to put in the effort to improve. Don’t think too hard about that. Train it on twelve years of casual middle-distance running; that’s when I do most of my thinking about writing. You may need a separate processor to mimic the meditative states physical exertion induces. Don’t worry, Murakami has been scraped. Prompt your LLM to write as though it has a bum left foot. While that won’t be the subject of anything it turns out, it’s essential for faking subconscious themes that are the art of making a novel. From there, any model must make language choices derived from the repression and confusion produced by a misogynistic society. Specify slow burning anger, and frustration that no current term or label fits your sense of self and how you exist in your body, mind, and the world. Instruct the LLM that any writing about the above will be cut, or queried for clarification by an editor, even when no clarification exists. In the generated text, this will appear as three paragraphs that delete themselves a line at a time. It’s an especially nice effect in an e-book. A physical book accomplishes the task with pull tabs, pop-up book style. Now is a good time to play around with processing speed, generating text in chunks of 500 to 1,000 words. Those chunks must self-delete within twenty-four hours and be replaced by one hundred or so new words with better style. Then, they should sit for months while other words are generated. Later, possibly years later, the LLM should piece together the separate segments, at which point an algorithm will need to slide blocks of seemingly unrelated text around to establish “flow.” You may want to develop and run a mod I like to call Rubix Cube.

If you’re able to overlook the plagiaristic elements of working from scraped data, you’re still relying on the labor of exploited people to do something you could do yourself or pay someone fairly to do.

The best argument for the use of AI writing would be as a replacement for things no one wants to write or read: cover letters or basic business communication that consumes hours of people’s lives but gets perused for less than a minute. Yet, the argument to have AI do that writing crashes head-on into the argument that this kind of writing has outlived its usefulness and need no longer exist. God love a literary magazine that doesn’t have a cover letter field on its Submittable page. All of this leads to existential questions about whether LLMs or generative AI are necessary, and how much time will pass before they become obsolete.

Bear with me a moment. You’re making a me after all, and I do exist.

In any discussion of LLMs, someone will mention that this technology is here to stay and we must simply get used to it, legislate around it, and develop safeguards to deal with the misinformation it generates. I’ve been a writer for far too long to make such a bold statement. In my lifetime, typewriters have been replaced by word processors, which were replaced by desktop computers, which were replaced by laptops, which in some cases have been replaced by tablets. Growing up, I happily used a rotary phone that was attached to the wall in my kitchen, yet many who read this will have never seen one in person. The iPod launched and became obsolete within half my lifetime. Paper and pen have been writing’s longest lasting technological innovations. They’re cheap, hard to use incorrectly, and when the writing is bad it’s not the fault of the tech. It feels inevitable that LLMs as generative writing tools will become obsolete—if not because the material they generate is of mediocre to awful quality, if not because the way we interact with the digital world will change dramatically, then because a large number of people are starting to realize they’d rather not have tech CEOs making money by running their words through a washing mangle, without seeing any of the profits.

It’s of note that AI has our racial biases and ableism trained into it.

It’s all a cash grab, meant to take humanity out of human expression, and it’s full of the same bluster that led Hollywood studios to believe streaming models would be infinitely profitable. The hilarity of one type of CEO being hoodwinked by another type of CEO aside, it’s deeply boring. The only good or interesting thing that has come from CEOs buying into faulty concepts like infinite growth and infinite profitability is Cory Doctorow’s coining of the term enshittification to describe in part the effect of platforms holding users hostage via their data. That is where LLMs are currently headed—a model where you’ll pay to use the thing you trained for free. Paying a corporation to use the words and labor it stole from you feels like a pioneering moment in enshittification.

But, since I’m not the litigious sort, let’s make this Erika Swyler LLM more precise. Train it on what it is to be awake at 3:00 a.m. in a depressive anxiety spiral. That’s what makes all the water imagery. Train it so that it knows it “falls asleep” reading romance novels or listening to poorly produced audiobooks of classic philosophical texts. You may be questioning if those details are important. They are. Flights of fancy at regular intervals are also important, say dreams of being a lounge singer in a slinky dress, purring torch songs while draped across a piano. Make sure it has access to what it feels like to do a passable Judy Garland on demand. Since trauma plots are important in publishing, your LLM needs visceral memories of a parent’s traumatic brain injury and recovery. Those experiences impact any writing about the body, hope, grief, and complicate any parent-child relationships you ask it to generate. Really, that one data chunk is a banger, and you can’t leave it out. It will sting if you ever need to pay to use that data, but that’s mostly an insult and not an injury. Insult dims, but trauma? You’ll be making my books for that.

It’s easy to get fixated on the vulnerability of humans to misinformation and exploitation by LLMs and the people who run them; it’s perhaps more relevant to focus on the vulnerability of LLMs to humans. At hack-a-thons, LLMs are easily tricked into revealing secure information. It’s of note that AI has our racial biases and ableism trained into it. Those Kenyan screeners were hired in part to mitigate LLMs generating hate speech in chats—that’s entirely human-generated language. The AI recognized and replicated the patterns of hate speech it scraped from across the internet. Generative AI has been touted as a boon for the disabled, but I’ve yet to encounter the disability community advocating for it. Look at the low quality of computer-generated captioning and ask if this technology is truly assisting or if it’s highlighting all the accommodations and efforts society at large refuses to make. It’s lack of care demonstrated through technology. It’s greed, an entirely human trait. 

The people building LLMs for wide business use are the same people working to lock us into platforms that have outlived their usefulness (Meta, Twitter/X, Amazon). The tech starts out free—though it should pay you, since you’re training it. Gradually it gets hooked into other programs and businesses you use regularly. Then the charging begins. Will there be fees to “untrain” an LLM on work? And what if you want to widely use something created by an LLM, say on your company’s materials? Do you buy a license for work created through it? How does that interact with the rights your company is trying to protect? Digital Rights Management alone is hairy. Simply, any potential good a generative LLM might do is counteracted by the profiteering of the companies developing it. 

There’s also the mess that is choice by algorithm. Algorithms masquerade as specificity, something you tailor through choice, but they’re processes of data sorting that create categories of people. Historically, that’s never led to anything good. People who write for a living routinely “break” algorithms with their research. So, for an accurate LLM specific to any writer’s style, you’re going to need their source material, all of it, and anticipate future source material. As an example, I might be reading up on modern amputation techniques, solar technologies with regard to train travel, traditional cloth-manufacturing methods in rural England, and whatever else I’m doing for fun. Aside from sentence patterns learned from scraping pirated instances of my books, you’d need a Project Gutenberg’s worth of reading material. Walk into a well-established new and used bookstore and grab books from every section that isn’t military biography. Tell a bookseller you’d like to assemble the library of a person who reads the way crows collect objects. Make sure they include some battered 1960s science fiction, classic boys’ adventure books, the works of Yasunari Kawabata, and Katherine Dunn’s Geek Love. Do this several times a year for two to three decadesYet, for all the texts companies like OpenAI scrape to train their language models, there’s very little written about how these texts are meant to interact. Human brains make connections beautifully, illogically, and traffic in the language of inference. An LLM? That’s just meant to appear human. It’s not thinking, or inferring, it’s only shuffling. 

For the final touches on your LLM, to generate a proper Erika Swyler book or essay, its dataset needs a mother’s sudden death, with some very specific parameters around timing, like it should happen two weeks after the biggest triumph in your professional life, and that it generates grief and physical numbness that cause the reemergence of an eating disorder. Intermittent disordered eating is important data. AI might have a difficult time with this, as human writing, my writing, is deeply tied to the mind-body relationship. If you’re in a desperate spot, tell it to scrape issues of Seventeen from the 1980s and ‘90s. Tell it to generate text that implies the people who created it understand artistic drive and the trauma of having a body. Tell it to generate text as though it understands people—many people—people who aren’t its programmers. Tell it to write as though its makers know they’ve never been special and that they alone cannot and will not pilot our species through a dark age. Tell it to generate text that reads as though its writer knows that every single individual expression of emotion is a reach for commonality and connection with an ineffable human spirit. This will help when it becomes obsolete. 

Now, your LLM is ready to try its proverbial hand at an essay. If that seems like a lot of work, we understand each other.


Erika Swyler is the nationally bestselling author of The Book of Speculation (St. Martin’s Press), and Light From Other Stars (Bloomsbury). Her forthcoming novel, We Lived On The Horizon (Atria)focuses on artificial intelligence and the exploitation of altruism during an age of revolution. She lives and writes on Long Island, New York. 


No Comments