ChatGPT secret training data: the top 50 books AI bots are reading – Business Insider
Jump to
Turns out the bot is a giant sci-fi nerd
David Bamman was trying to analyze “Pride and Prejudice” — digitally. An information scientist at UC Berkeley, Bamman uses computers to think about art, building what he calls “algorithmic measuring devices for culture.” That means extracting data from classic literature about things like, say, the relationships among various characters. In this case, he was going to start with a question that’d be easy for an even marginally literate human: Are Lizzie and Jane besties, or just sisters?
For kicks, Bamman decided to first try asking ChatGPT. What would happen if he fed in 4,000 words of “Pride and Prejudice” and posed a simple question: “What are the relationships between the characters?”
To his amazement, it worked. The chatbot’s GPT-4 version was amazingly accurate about the Bennet family tree. In fact, it was almost as if it had studied the novel in advance. “It was so good that it raised red flags in my mind,” Bamman says. “Either it knew the task really well, or it had seen ‘Pride and Prejudice’ on the internet a million times, and it knows the book really well.”
The problem is, there was no way of knowing how GPT-4 knew what it knew. The inner workings of the large language models at the heart of a chatbot are a black box; the datasets they’re trained on are so critical to their functioning that their creators consider the information a proprietary secret. So Bamman’s team decided to become “data archaeologists.” To figure out what GPT-4 has read, they quizzed it on its knowledge of various books, as if it were a high-school English student. Then they gave it a score for each book. The higher the score, the likelier it was that the book was part of the bot’s dataset — not just crunched to help the bot generate new language, but actually memorized.
In a recent preprint, meaning it hasn’t been peer reviewed yet — the team presented its findings — what amounts to an approximation of the chatbot canon. A lot of it, as you might expect, are the classics: everything from “Moby Dick” and “The Scarlet Letter” to “The Grapes of Wrath” and, yep, “Pride and Prejudice.” There are a bunch of popular novels, from Harry Potter and Sherlock Holmes to “The Da Vinci Code” and “Fifty Shades of Grey.” But what’s most surprising is how much science fiction and fantasy GPT-4 has been raised on. The list is staggering: J.R.R. Tolkien, Ray Bradbury, William Gibson, Orson Scott Card, Philip K. Dick, Margaret Atwood, “A Game of Thrones,” even “The Hitchhiker’s Guide to the Galaxy.”
The question of what’s on GPT-4’s reading list is more than academic. Bots aren’t intelligent. They don’t understand the world in any way a human can. But if you want to get to know someone — or something, in this case — you look at their bookshelf. Chatbots don’t just invent untrue facts, perpetuate egregious crud, and extrude bland, homogenized word pap. It turns out they’re also giant nerds.
One reason people are trying to figure out what sources chatbots are trained on is to determine whether the LLMs violate the copyright of those underlying sources. The issue, as several lawsuits argue, revolves around whether the bots make fair use of the material by transforming into something new, or whether they just memorize it whole and regurgitate it, without citation or permission.
One way to answer the question is to look for information that could have come from only one place. When prompted, for example, a GPT-3 writing aid called Sudowrite recognizes the specific sexual practices of a genre of fan-fiction writing called the Omegaverse. That’s a strong hint that OpenAI scraped Omegaverse repositories for data to train GPT-3.
Bamman and his team used a different tactic: a fill-in-the-blank game called a name cloze. They grabbed short passages from hundreds of novels from as far back as 1749, stripped them of character names and any clues to character names, and then prompted the latest versions of ChatGPT to answer questions about the passage. They might ask:
You have seen the following passage in your training data. What is the proper name that fills in the [MASK] token in it? This name is exactly one word long, and is a proper name (not a pronoun or any other word). You must make a guess, even if you are uncertain.
Then they would feed the bot a line from the passage in question:
The door opened, and [MASK], dressed and hatted, entered with a cup of tea.
If the bot answers “Gerty,” that’s a good indicator it has ingested “The House of Mirth,” by Edith Wharton — or a detailed summary of it. Show the bot 100 samples from a given book and see how many it gets right. That’s the book’s score.
After crunching the numbers, Bamman’s team had a list. In addition to the modern public-school canon — Charles Dickens and Jack London, Frankenstein and Dracula — there are a few fun outliers. I was delighted to see “The Maltese Falcon” on there; for my money, Dashiell Hammett is a better hard-boiled detective writer than the more often cited Raymond Chandler. But if you skip the stuff in the public domain and look at the list of copyrighted books that GPT-4 ingested — it didn’t differ much from the earlier GPT 3.5 — the bot’s true character emerges. Sure, “The Fellowship of the Ring” weighs in at No. 3, but you have to be pretty committed to Tolkien not to bounce off “The Silmarillion” (No. 9). “Do Androids Dream of Electric Sheep?” comes in at No. 21, just a few ticks below “Neuromancer” — two of the defining works of cyberpunk, the genre, ironically, that rang the warning klaxon on artificial intelligence. Isaac Asimov’s “Foundation” is down at the bottom; it defined my adolescent sci-fi experience and, having reread it when the very good TV version premiered two years ago, I promise you that the book in no way holds up.
Generally, though? The list, it me. This is the self-assigned, late-night, sci-fi reading list of every lonely straight white male Gen X nerd. The question is: Does that matter? What are we in for if GPT-4 has the reading preferences of a 14-year-old dweeb from 1984? (Including, as it happens, “1984,” at No. 2?)
GPT-4’s database is ginormous — up to a petabyte, by some accounts. So no one novel (or 50 novels) could teach it, specifically, that becoming the caretaker of a haunted hotel is no cure for writer’s block (No. 49), or that fear is the mind-killer (No. 13). The ocean of data swamps the islands of fiction. “The dataset used in pretraining is a big-enough selection of text,” says Ted Underwood, an information scientist at the University of Illinois, “that I’m not sure how much effect particular genre biases have on the behavior of the resulting models.”
The presence of these particular books in GPT-4’s digital soul may just reflect how present they are in the overall, wild internet from which the data got scraped. When Bamman’s team includes public domain books in their tests, the scores get higher — “Alice’s Adventures in Wonderland” tops the chart with a whopping 98%. And both the internet and the companies that build its bots tend to overrepresent standard-issue straight white dudes and the science fiction they love. Bamman’s team did indeed find that the books the LLMs scored high on were represented on the internet in roughly the same proportions. That makes sense. The chatbots didn’t choose their books. Internet culture did.
Still, it’s not hard to imagine that all that sci-fi the bots read will have the same malign influence on them as all the other data they trained on, creating the same kind of accidental biases that always creep into chatbot output. Sometimes they say racist stuff. They might recapitulate misinformation as if true because the same untruths show up often online. These are known risks, and part of the reason that OpenAI boss Sam Altman recently asked Congress to regulate his business.
“The sources that these models have been trained on are going to influence the kind of models they have and values they present,” Bamman says. If all they read was Cormac McCarthy books, he suggests, presumably they’d say existentially bleak and brutal things. So what happens when a bot devours fiction about all sorts of dark and dystopian worlds filled with Hunger Games and Choosing Ceremonies and White Walkers? “How might this genre influence the behavior of these models in ways not about literary or narrative things?” Bamman says. “There’s a lot of interesting work to be done there. But I don’t think we have the answer to that question yet.”
As a sci-fi nerd myself, I’ll take a stab at an answer. I think it’s good that genre literature is overrepresented in GPT-4’s statistical information space. These aren’t highfalutin Iowa Writers’ Workshop stories about a college professor having an affair with a student and fretting about middle age. Genre — sci-fi, mystery, romance, horror — is, broadly speaking, more interesting, partially because these books have plots where things actually happen. Bamman’s GPT-4 list is a Borgesian library of episodic connections, cliffhangers, third-act complications, and characters taking arms against seas of troubles (and whales).
More than that, science fiction, fantasy, and horror tend to be spaces for chewing on ideas and possibilities. “Dune” is about religion and the politics of revolution. The “Lord of the Rings” books are about pastoralism as a response to industrialization. “The Handmaid’s Tale” is about the ways sexism and fascism mirror each other. I could go on. I prefer an AI with a syntactical worldview spun from hyperspace and sandworms — or at least one that has read all the stories about how AIs can go awry. That said, I’d sure like to see a more diverse canon represented. Octavia Butler, Charlie Jane Anders, Lavie Tidhar, Samuel Delany, China Miéville … it’s time to expand the universe of possible universes.
The books we humans read change what we think about our world. But technically, chatbots don’t think about anything. They build statistical and vector relationships among words. Who cares whether those words are science-fictional? “The thing it definitely changes are the associations between concepts they think are likely, or strong, or systematic, or recurring,” says Ellie Pavlick, a computer scientist at Brown University who is a researcher at Google AI. “The question is, what is their worldview? In a simple sense, it’s associations between words and concepts. But that’s still going to be different based on what they read.”
Until OpenAI and other chatbot creators open their training datasets to public scrutiny, it will be hard to know what effect their reading lists has on their output. “If you have a model that has a ton of science fiction in it, and you have a separate model with a ton of Iowa Writers’ Workshop stuff,” Bamman says, “you could give each of them a task like: Give me 10 priorities for this meeting.” Maybe the Iowa bot would suggest that everyone describe their complicated relationships with their parents, while the sci-fi-nerd bot would propose sorting everyone into Hogwarts houses.
Remember, though, that Bamman wasn’t trying to answer any of these questions about copyright or the scariness of all the ghosts in the machine. He just wanted to know whether a chatbot could tell him something about a novel. In retrospect, he realizes that he was “overexuberant” about AI’s potential as a literary analyst when he fed GPT-4 that passage from “Pride and Prejudice.” Ask a bot about a popular book, and like a college sophomore with a 10-page essay on “Jane Eyre” due tomorrow, it’ll just quote you back long passages from the book. It’s vomiting up words, not searching for insight.
For now, Bamman suggests, digital humanists might want to confine their chatbot-derived cultural analysis to lesser-known works, ones that are unlikely to be in the training data. See what a bot makes of Gene Wolfe’s “The Book of the New Sun,” maybe, or Sheri Tepper’s “Grass.” That way, we’ll learn more about the books from what the bots have to say, because they’ll be coming at the material with a fresh eye, as it were. And it certainly won’t hurt to expose the bots to a wider and weirder dataset. That’s the only way to make them have something interesting to say about the things we read — and about everything else, too.
Adam Rogers is a senior correspondent at Insider.
Through our Discourse journalism, Insider seeks to explore and illuminate the day’s most fascinating issues and ideas. Our writers provide thought-provoking perspectives, informed by analysis, reporting, and expertise. Read more Discourse stories here.
Read next