Zuckerberg Appeared to Know Meta Trained AI on Pirated Library
Newly unsealed documents show how Meta used LibGen, a pirated library of ebooks, to train its Llama 3 chatbot.
The AI rush has brought with it thorny questions of copyright and ownership of data as tech companies train bots like ChatGPT on existing texts, but it seems Meta largely brushed these aside as they worked to integrate such tools into Facebook and Instagram.
As first revealed in a motion filed by attorneys for novelists Christopher Golden and Richard Kadrey and comedian Sarah Silverman, who are pursuing a class-action suit against Meta for allegedly using their copyrighted work without permission, employees at the tech giant had candid conversations about the potential for scandal that would arise from leveraging a risky resource: Library Genesis, or LibGen, a massive so-called “shadow library” of free downloadable ebooks and PDFs that includes otherwise paywalled research and academic articles. In these exchanges, Meta’s engineers identified LibGen as “adataset we know to be pirated,” but indicated that CEO Mark Zuckerberg had approved its use for training the next iteration of its large language model, Llama.
Now, under a court order from Judge Vince Chhabria of the U.S. District Court for the Northern District of California, the records of those previously confidential internal dialogues have been unsealed, and appear to confirm Zuckerberg’s decision to greenlight the transfer of pirated, copyrighted LibGen data to improve Llama — despite concerns about a backlash. In an email to Joelle Pineau, vice president of AI research at Meta, Sony Theakanath, director of product management, wrote, “After a prior escalation to MZ [Mark Zuckerberg], GenAI has been approved to use LibGen for Llama 3 […] with a number of agreed upon mitigations.” The note observed that including the LibGen material would help them reach certain performance benchmarks, and alluded to industry rumors that other AI companies, including OpenAI and Mistral AI, are “using the library for their models.” In the same email, Theakanath wrote that under no circumstances would Meta publicly disclose its use of LibGen.
The same email lays out the legal exposures and potential negative media attention that could follow if “external parties” deduce that the LibGen trove formed part of Llama’s training data: “Copyright and IP is top of mind for legislators around the world, including in the US and EU,” the document states. “US legislators expressed concern in a recent hearing about AI developers using pirated websites for training. It’s unclear what their legislative actions would be if the concern spreads, but it reflects some of the negative lobbying right holders have been doing, related to our litigation on this topic (along the lines that this is ‘stolen’ content that then taints the output of this model).”
Meta did not immediately return a request for comment on these internal communications.
Elsewhere in the unsealed documents, Meta employees describe methods for processing and filtering text from LibGen in order to remove “boilerplate” indications of copyright, such as “ISBN,” “Copyright,” “©,” and “All rights reserved.” The author of a memo titled “Observations on LibGen-SciMag” (“SciMag” is the library’s catalogue of science journals) reports that the material’s “quality is high and the documents are long so this should be great data to learn from, in particular, for highly specialized knowledge!” The same memo recommends trying to “remove more copyright headers and document identifiers” — seemingly more evidence that Meta was looking to cover its tracks as it exploited this cache of technical text that it did not have permission to use.
Other revealing messages show Meta’s AI research team and executives discussing best methods for obtaining the LibGen data set besides directly torrenting it, or downloading via peer-to-peer file sharing, from the company’s IP addresses. At some points, employees wondered if this was even allowed. “I think torrenting from a corporate laptop doesn’t feel right,” wrote one engineer in April 2023, adding a smiley face emoji. (A later email acknowledged that the “SciMag” data had indeed been torrented.) And in October 2023 messages to a researcher working on Llama, Ahmad Al-Dahle, vice president of GenAI at Meta, said he had “cleared the path to use” LibGen and was “pushing from the top” to incorporate other data sets to improve Llama and win the AI race.
It’s no wonder Meta fought the unsealing and unredacting of these discussions as the discovery period in the copyright lawsuit came to an end: they seem to damage the company’s argument that “using text to statistically model language and generate original expression” falls under the legal rubric of fair use, or the permissible limited use of copyrighted material without permission, as its lawyers put it in a motion to dismiss the suit. The plaintiffs’ attorneys, moreover, recorded in their latest filing that Zuckerberg himself in a recent deposition said that the kind of piracy described in their latest amended complaint would raise “lots of red flags” and “seems like a bad thing.”
Of course, Meta, which Tuesday announced it will be cutting the 5 percent of its workforce deemed its “lowest performers,” or some 3,600 workers, is hardly alone as a Silicon Valley behemoth accused of flouting (or circumventing) copyright law. This class action could prove a bellwether for the many other suits in progress against AI companies regarding the ownership of photographs, art, music, journalism, books, and more. But as long as tech firms are hungrily searching for more stuff for its bots to replicate and remix, they will always be reliant on the original content creators: human beings.