Close Menu
Technology Mag

    Subscribe to Updates

    Get the latest creative news from FooBar about art, design and business.

    What's Hot
    An Anarchist’s Conviction Offers a Grim Foreshadowing of Trump’s War on the ‘Left’

    An Anarchist’s Conviction Offers a Grim Foreshadowing of Trump’s War on the ‘Left’

    November 12, 2025
    Google is trying to take down a group sending you all those spammy texts

    Google is trying to take down a group sending you all those spammy texts

    November 12, 2025
    Extreme smart home makeover

    Extreme smart home makeover

    November 11, 2025
    Facebook X (Twitter) Instagram
    Subscribe
    Technology Mag
    Facebook X (Twitter) Instagram YouTube
    • Home
    • News
    • Business
    • Games
    • Gear
    • Reviews
    • Science
    • Security
    • Trending
    • Press Release
    Technology Mag
    Home » Here’s Proof You Can Train an AI Model Without Slurping Copyrighted Content
    Business

    Here’s Proof You Can Train an AI Model Without Slurping Copyrighted Content

    News RoomBy News RoomMarch 21, 20244 Mins Read
    Facebook Twitter Pinterest LinkedIn Reddit WhatsApp Email
    Here’s Proof You Can Train an AI Model Without Slurping Copyrighted Content

    In 2023, OpenAI told the UK parliament that it was “impossible” to train leading AI models without using copyrighted materials. It’s a popular stance in the AI world, where OpenAI and other leading players have used materials slurped up online to train the models powering chatbots and image generators, triggering a wave of lawsuits alleging copyright infringement.

    Two announcements Wednesday offer evidence that large language models can in fact be trained without the permissionless use of copyrighted materials.

    A group of researchers backed by the French government have released what is thought to be the largest AI training dataset composed entirely of text that is in the public domain. And the nonprofit Fairly Trained announced that it has awarded its first certification for a large language model built without copyright infringement, showing that technology like that behind ChatGPT can be built in a different way to the AI industry’s contentious norm.

    “There’s no fundamental reason why someone couldn’t train an LLM fairly,” says Ed Newton-Rex, CEO of Fairly Trained. He founded the nonprofit in January 2024 after quitting his executive role at image-generation startup Stability AI because he disagreed with its policy of scraping content without permission.

    Fairly Trained offers a certification to companies willing to prove that they’ve trained their AI models on data that they own, have licensed, or that is in the public domain. When the nonprofit launched, some critics pointed out that it hadn’t yet identified a large language model that met those requirements.

    Today, Fairly Trained announced it has certified its first large language model. It’s called KL3M and was developed by Chicago-based legal tech consultancy startup 273 Ventures, using a curated training dataset of legal, financial, and regulatory documents.

    The company’s cofounder, Jillian Bommarito, says the decision to train KL3M in this way stemmed from the company’s “risk-averse” clients like law firms. “They’re concerned about the provenance, and they need to know that output is not based on tainted data,” she says. “We’re not relying on fair use.” The clients were interested in using generative AI for tasks like summarizing legal documents and drafting contracts but didn’t want to get dragged into lawsuits about intellectual property as OpenAI, Stability AI, and others have been.

    Bommarito says that 273 Ventures hadn’t worked on a large language model before but decided to train one as an experiment. “Our test to see if it was even possible,” she says. The company has created its own training dataset, the Kelvin Legal DataPack, which includes thousands of legal documents reviewed to comply with copyright law.

    Although the dataset is tiny (around 350 billion tokens, or units of data) compared to those compiled by OpenAI and others that have scraped the internet en masse, Bommarito says the KL3M model performed far better than expected, something she attributes to how carefully the data had been vetted beforehand. “Having clean, high-quality data may mean that you don’t have to make the model so big,” she says. Curating a dataset can help make a finished AI model specialized to the task it’s designed for. 273 Ventures is now offering spots on a wait list to clients who want to purchase access to this data.

    Clean Sheet

    Companies looking to emulate KL3M may have more help in the future in the form of freely available infringement-free datasets. On Wednesday, researchers released what they claim is the largest available AI dataset for language models composed purely of public domain content. Common Corpus, as it is called, is a collection of text roughly the same size as the data used to train OpenAI’s GPT-3 text generation model and has been posted to the open source AI platform Hugging Face.

    The dataset was built from sources like public domain newspapers digitized by the US Library of Congress and the National Library of France. Pierre-Carl Langlais, project coordinator for Common Corpus, calls it a “big enough corpus to train a state-of-the-art LLM.” In the lingo of big AI, the dataset contains 500 billion tokens. OpenAI’s most capable model is widely believed to have been trained on several trillions.

    Share. Facebook Twitter Pinterest LinkedIn WhatsApp Reddit Email
    Previous ArticleBMW’s Vision Neue Klasse X Has a Car-Wide Screen and a ‘Joy’ Brain
    Next Article State legislator makes deepfake of colleague to prove deepfakes are bad

    Related Posts

    TikTok Shop Is Now the Size of eBay

    TikTok Shop Is Now the Size of eBay

    November 10, 2025
    WIRED Roundup: Alpha School, Grokipedia, and Real Estate AI Videos

    WIRED Roundup: Alpha School, Grokipedia, and Real Estate AI Videos

    November 6, 2025
    WIRED Roundup: AI Psychosis, Missing FTC Files, and Google Bedbugs

    WIRED Roundup: AI Psychosis, Missing FTC Files, and Google Bedbugs

    November 6, 2025
    AI Agents Are Terrible Freelance Workers

    AI Agents Are Terrible Freelance Workers

    November 5, 2025
    Extropic Aims to Disrupt the Data Center Bonanza

    Extropic Aims to Disrupt the Data Center Bonanza

    November 4, 2025
    Elon Musk’s Grokipedia Pushes Far-Right Talking Points

    Elon Musk’s Grokipedia Pushes Far-Right Talking Points

    November 3, 2025
    Our Picks
    Google is trying to take down a group sending you all those spammy texts

    Google is trying to take down a group sending you all those spammy texts

    November 12, 2025
    Extreme smart home makeover

    Extreme smart home makeover

    November 11, 2025
    The 30 best gift ideas for mom this holiday season

    The 30 best gift ideas for mom this holiday season

    November 11, 2025
    Amazon’s like-new Kindle Paperwhite Signature Edition is on sale for just 7

    Amazon’s like-new Kindle Paperwhite Signature Edition is on sale for just $127

    November 11, 2025
    • Facebook
    • Twitter
    • Pinterest
    • Instagram
    • YouTube
    • Vimeo
    Don't Miss
    Google Photos lets iPhone users edit images by describing changes News

    Google Photos lets iPhone users edit images by describing changes

    By News RoomNovember 11, 2025

    Google is rolling out several AI updates to its Google Photos app, including iOS support…

    Pixel phones are getting notification summaries

    Pixel phones are getting notification summaries

    November 11, 2025
    Google is introducing its own version of Apple’s private AI cloud compute

    Google is introducing its own version of Apple’s private AI cloud compute

    November 11, 2025
    AI chatbots are helping hide eating disorders and making deepfake ‘thinspiration’ 

    AI chatbots are helping hide eating disorders and making deepfake ‘thinspiration’ 

    November 11, 2025
    Facebook X (Twitter) Instagram Pinterest
    • Privacy Policy
    • Terms of use
    • Advertise
    • Contact
    © 2025 Technology Mag. All Rights Reserved.

    Type above and press Enter to search. Press Esc to cancel.