Close Menu
Technology Mag

    Subscribe to Updates

    Get the latest creative news from FooBar about art, design and business.

    What's Hot

    Microsoft starts testing  ‘Hey, Copilot!’ in Windows

    May 14, 2025

    7 of our favorite deals from Amazon’s 48-hour Pet Day sale

    May 14, 2025

    GM’s New Battery Tech Could Be a Breakthrough for Affordable EVs

    May 14, 2025
    Facebook X (Twitter) Instagram
    Subscribe
    Technology Mag
    Facebook X (Twitter) Instagram YouTube
    • Home
    • News
    • Business
    • Games
    • Gear
    • Reviews
    • Science
    • Security
    • Trending
    • Press Release
    Technology Mag
    Home » Inside Meta’s race to beat OpenAI: “We need to learn how to build frontier and win this race”
    News

    Inside Meta’s race to beat OpenAI: “We need to learn how to build frontier and win this race”

    News RoomBy News RoomJanuary 15, 20255 Mins Read
    Facebook Twitter Pinterest LinkedIn Reddit WhatsApp Email

    A major copyright lawsuit against Meta has revealed a trove of internal communications about the company’s plans to develop its open-source AI models, Llama, which include discussions about avoiding “media coverage suggesting we have used a dataset we know to be pirated.”

    The messages, which were part of a series of exhibits unsealed by a California court, suggest Meta used copyrighted data when training its AI systems and worked to conceal it — as it raced to beat rivals like OpenAI and Mistral. Portions of the messages were first revealed last week.

    In an October 2023 email to Meta AI researcher Hugo Touvron, Ahmad Al-Dahle, Meta’s vice president of generative AI, wrote that the company’s goal “needs to be GPT4,” referring to the large language model OpenAI announced in March of 2023. Meta had “to learn how to build frontier and win this race,” Al-Dahle added. Those plans apparently involved the book piracy site Library Genesis (LibGen) to train its AI systems.

    An undated email from Meta director of product Sony Theakanath, sent to VP of AI research Joelle Pineau, weighed whether to use LibGen internally only, for benchmarks included in a blog post, or to create a model trained on the site. In the email, Theakanath writes that “GenAI has been approved to use LibGen for Llama3… with a number of agreed upon mitigations” after escalating it to “MZ” — presumably Meta CEO Mark Zuckerberg. As noted in the email, Theakanath believed “Libgen is essential to meet SOTA [state-of-the-art] numbers,” adding “it is known that OpenAI and Mistral are using the library for their models (through word of mouth).” Mistral and OpenAI haven’t stated whether or not they use LibGen. (The Verge reached out to both for more information).

    Meta’s Theakanath writes that LibGen is “essential” to reaching “SOTA numbers across all categories.”
    Screenshot: The Verge

    The court documents stem from a class action lawsuit that author Richard Kadrey, comedian Sarah Silverman, and others filed against Meta, accusing it of using illegally obtained copyrighted content to train its AI models in violation of intellectual property laws. Meta, like other AI companies, has argued that using copyrighted material in training data should constitute legal fair use. The Verge reached out to Meta with a request for comment but didn’t immediately hear back.

    Some of the “mitigations” for using LibGen included stipulations that Meta must “remove data clearly marked as pirated/stolen,” while avoiding externally citing “the use of any training data” from the site. Theakanath’s email also said the company would need to “red team” the company’s models “for bioweapons and CBRNE [Chemical, Biological, Radiological, Nuclear, and Explosives]” risks.

    The email also went over some of the “policy risks” posed by the use of LibGen as well, including how regulators might respond to media coverage suggesting Meta’s use of pirated content. “This may undermine our negotiating position with regulators on these issues,” the email said. An April 2023 conversation between Meta researcher Nikolay Bashlykov and AI team member David Esiobu also showed Bashlykov admitting he’s “not sure we can use meta’s IPs to load through torrents [of] pirate content.”

    Other internal documents show the measures Meta took to obscure the copyright information in LibGen’s training data. A document titled “observations on LibGen-SciMag” shows comments left by employees about how to improve the dataset. One suggestion is to “remove more copyright headers and document identifiers,” which includes any lines containing “ISBN,” “Copyright,” “All rights reserved,” or the copyright symbol. Other notes mention taking out more metadata “to avoid potential legal complications,” as well as considering whether to remove a paper’s list of authors “to reduce liability.”

    The document discusses removing “copyright headers and document identifiers.”
    Screenshot: The Verge

    Last June, The New York Times reported on the frantic race inside Meta after ChatGPT’s debut, revealing the company had hit a wall: it had used up almost every available English book, article, and poem it could find online. Desperate for more data, executives reportedly discussed buying Simon & Schuster outright and considered hiring contractors in Africa to summarize books without permission.

    In the report, some executives justified their approach by pointing to OpenAI’s “market precedent” of using copyrighted works, while others argued Google’s 2015 court victory establishing its right to scan books could provide legal cover. “The only thing holding us back from being as good as ChatGPT is literally just data volume,” one executive said in a meeting, per The New York Times.

    It’s been reported that frontier labs like OpenAI and Anthropic have hit a data wall, which means they don’t have sufficient new data to train their large language models. Many leaders have denied this, OpenAI CEO Sam Altman said plainly: “There is no wall.” OpenAI cofounder Ilya Sutskever, who left the company last May to start a new frontier lab, has been more straightforward about the potential of a data wall. At a premier AI conference last month, Sutskever said: “We’ve achieved peak data and there’ll be no more. We have to deal with the data that we have. There’s only one internet.”

    This data scarcity has led to a whole lot of weird, new ways to get unique data. Bloomberg reported that frontier labs like OpenAI and Google have been paying digital content creators between $1 and $4 per minute for their unused video footage through a third-party in order to train LLMs (both of those companies have competing AI video-generation products).

    With companies like Meta and OpenAI hoping to grow their AI systems as fast as possible, things are bound to get a bit messy. Though a judge partially dismissed Kadrey and Silverman’s class action lawsuit last year, the evidence outlined here could strengthen parts of their case as it moves forward in court.

    Share. Facebook Twitter Pinterest LinkedIn WhatsApp Reddit Email
    Previous ArticleHoney: all the news about PayPal’s alleged scam coupon app
    Next Article Sonos’ chief product officer is leaving the company

    Related Posts

    Microsoft starts testing  ‘Hey, Copilot!’ in Windows

    May 14, 2025

    7 of our favorite deals from Amazon’s 48-hour Pet Day sale

    May 14, 2025

    Plugable’s new dock supports five displays from one USB-C port

    May 14, 2025

    TikTok is using AI-generated alt text to describe photos

    May 14, 2025

    WiiM’s Sound smart speaker looks like a HomePod for audiophiles

    May 14, 2025

    Google will let restaurants highlight specials on their search profiles

    May 14, 2025
    Our Picks

    7 of our favorite deals from Amazon’s 48-hour Pet Day sale

    May 14, 2025

    GM’s New Battery Tech Could Be a Breakthrough for Affordable EVs

    May 14, 2025

    Plugable’s new dock supports five displays from one USB-C port

    May 14, 2025

    Google Is Using On-Device AI to Spot Scam Texts and Investment Fraud

    May 14, 2025
    • Facebook
    • Twitter
    • Pinterest
    • Instagram
    • YouTube
    • Vimeo
    Don't Miss
    News

    TikTok is using AI-generated alt text to describe photos

    By News RoomMay 14, 2025

    TikTok is introducing new accessibility features that make it easier for people with visual impairments…

    A VIP Seat at Donald Trump’s Crypto Dinner Cost at Least $2 Million

    May 14, 2025

    The Minimal Phone Can Help Limit Your Time on Social Media—With Compromises

    May 14, 2025

    WiiM’s Sound smart speaker looks like a HomePod for audiophiles

    May 14, 2025
    Facebook X (Twitter) Instagram Pinterest
    • Privacy Policy
    • Terms of use
    • Advertise
    • Contact
    © 2025 Technology Mag. All Rights Reserved.

    Type above and press Enter to search. Press Esc to cancel.