Inside Meta’s race to beat OpenAI: “We need to learn how to build the frontier and win this race”

14
Jan 25
By | Other

A major copyright lawsuit against Meta has revealed a trove of internal communications about the company’s plans to develop its open-source artificial intelligence models, Llama, which include discussions about avoiding “media coverage which suggests that we have used a data set that we know has been hacked.”

The messages, which were part of a series of exhibits unsealed by a California court, suggest that Meta used copyrighted data when training its AI systems and worked to hide it — while competing for beating rivals like OpenAI and Mistral. Some of the messages were first revealed last week.

In an October 2023 email to Meta AI researcher Hugo Touvron, Ahmad Al-Dahle, Meta’s vice president of generative AI, wrote that the company’s goal “must be GPT4,” referring to the OpenAI large language model announced in March 2023. Meta had to “learn how to build the border and win this race,” Al-Dahle added. Those plans apparently involved book piracy site Library Genesis (LibGen) to train its AI systems.

An undated email from Meta product director Sony Theakanath to VP of AI research Joelle Pineau weighed whether to use LibGen only internally, for benchmarks included in a blog post, or to create a trained model on the site. In the email, Theakanath writes that “GenAI has been approved to use LibGen for Llama3… with a number of agreed upon mitigations” after escalating it to “MZ” – presumably Meta CEO Mark Zuckerberg. As mentioned in the email, Theakanath believed that “Libgen is essential to meet SOTA [state-of-the-art] numbers,” adding “OpenAI and Mistral are known to be using the library for their models (by word of mouth). Mistral and OpenAI have not stated whether or not they use LibGen. (threshold contact both for more information).

Meta’s Theakanath writes that LibGen is “essential” to achieving “SOTA numbers in all categories.”
Screenshot: The Verge

The court documents stem from a class-action lawsuit that author Richard Kadrey, comedian Sarah Silverman and others filed against Meta, accusing it of using illegally obtained copyrighted content to train its AI models in violation of intellectual property laws. Meta, like other AI companies, has argued that using copyrighted material in training data should constitute legal fair use. threshold reached out to Meta with a request for comment, but did not immediately respond.

Some of the “mitigations” for using LibGen included stipulations that Meta must “remove data clearly marked as pirated/stolen”, avoiding the external citation “using any training data” from the site. Theakanath’s email also said the company would have to “red-equip” the company’s models “for bioweapons and CBRNE [Chemical, Biological, Radiological, Nuclear, and Explosives]” risks.

The email also addressed some of the “policy risks” posed by using LibGen as well, including how regulators might respond to media coverage suggesting Meta’s use of pirated content. “This could harm our negotiating position with regulators on these matters,” the email said. An April 2023 conversation between Meta researcher Nikolay Bashlykov and AI team member David Esiobu also showed Bashlykov admitting that he’s “not sure we can use meta IPs to upload through torrents [of] pirated content.”

Other internal documents show the measures Meta took to hide copyright information in LibGen’s training data. A document titled “observations in LibGen-SciMag” shows comments left by employees on how to improve the dataset. One suggestion is to “remove more copyright titles and document identifiers,” which includes any line that contains “ISBN,” “Copyright,” “All rights reserved,” or the copyright. Other notes mention getting more metadata “to avoid potential legal complications,” as well as considering whether to remove a paper’s author list “to reduce liability.”

The document discusses the removal of “copyright titles and document identifiers”.
Screenshot: The Verge

last June, New York Times reported on the frenzied race within Meta following ChatGPT’s debut, revealing that the company had hit a wall: it had used nearly every available English-language book, article, and poem it could find online. Desperate for more data, executives reportedly discussed buying Simon & Schuster outright and considered hiring contractors in Africa to compile the unauthorized books.

In the report, some executives justified their approach by pointing to OpenAI’s “market precedent” for using copyrighted works, while others argued that Google’s 2015 court victory ruling her right to scan books may provide legal cover. “The only thing holding us back from being as good as ChatGPT is literally just the volume of data,” one executive told a meeting, per New York Times.

It has been reported that frontier labs like OpenAI and Anthropic have hit a data wall, meaning they don’t have enough new data to train their large language models. Many leaders have denied this, OpenAI CEO Sam Altman clearly stated: “There is no wall.” OpenAI co-founder Ilya Sutskever, who left the company last May to start a new frontier lab, has been more direct about the potential of a data wall. At a major AI conference last month, Sutskever said: “We’ve reached peak data and there won’t be any more. We have to deal with the data we have. There is only one internet.”

This lack of data has led to many strange, novel ways to obtain unique data. Bloomberg reported that frontier labs such as OpenAI and Google have paid digital content creators between $1 and $4 per minute for their unused video footage through a third party in order to train LLMs (both of these companies have competing generation products of AI videos).

With companies like Meta and OpenAI hoping to ramp up their AI systems as quickly as possible, things are bound to get a little messy. Although a judge partially dismissed Kadrey and Silverman’s lawsuit last year, the evidence described here could strengthen parts of their case as it moves forward in court.

Click any of the icons to share this post:

 

Categories