Close Menu
Technology Mag

    Subscribe to Updates

    Get the latest creative news from FooBar about art, design and business.

    What's Hot
    Musk says he’s building Terafab chip plant in Austin, Texas

    Musk says he’s building Terafab chip plant in Austin, Texas

    March 22, 2026
    The pint-sized Sonos Roam 2 is more over 20 percent this weekend

    The pint-sized Sonos Roam 2 is more over 20 percent this weekend

    March 22, 2026
    Online age checks came first — a VPN crackdown could be next

    Online age checks came first — a VPN crackdown could be next

    March 22, 2026
    Facebook X (Twitter) Instagram
    Subscribe
    Technology Mag
    Facebook X (Twitter) Instagram YouTube
    • Home
    • News
    • Business
    • Games
    • Gear
    • Reviews
    • Science
    • Security
    • Trending
    • Press Release
    Technology Mag
    Home » Harvard Is Releasing a Massive Free AI Training Dataset Funded by OpenAI and Microsoft
    Business

    Harvard Is Releasing a Massive Free AI Training Dataset Funded by OpenAI and Microsoft

    News RoomBy News RoomDecember 11, 20243 Mins Read
    Facebook Twitter Pinterest LinkedIn Reddit WhatsApp Email
    Harvard Is Releasing a Massive Free AI Training Dataset Funded by OpenAI and Microsoft

    Harvard University announced Thursday it’s releasing a high-quality dataset of nearly one million public-domain books that could be used by anyone to train large language models and other AI tools. The dataset was created by Harvard’s newly formed Institutional Data Initiative with funding from both Microsoft and OpenAI. It contains books scanned as part of the Google Books project that are no longer protected by copyright.

    Around five times the size of the notorious Books3 dataset that was used to train AI models like Meta’s Llama, the Institutional Data Initiative’s database spans genres, decades, and languages, with classics from Shakespeare, Charles Dickens, and Dante included alongside obscure Czech math textbooks and Welsh pocket dictionaries. Greg Leppert, executive director of the Institutional Data Initiative, says the project is an attempt to “level the playing field” by giving the general public, including small players in the AI industry and individual researchers, access to the sort of highly-refined and curated content repositories that normally only established tech giants have the resources to assemble. “It’s gone through rigorous review,” he says.

    Leppert believes the new public domain database could be used in conjunction with other licensed materials to build artificial intelligence models. “I think about it a bit like the way that Linux has become a foundational operating system for so much of the world,” he says, noting that companies would still need to use additional training data to differentiate their models from those of their competitors.

    Burton Davis, Microsoft’s vice president and deputy general counsel for intellectual property, emphasized that the company’s support for the project was in line with its broader beliefs about the value of creating “pools of accessible data” for AI startups to use that are “managed in the public’s interest.” In other words, Microsoft isn’t necessarily planning to swap out all of the AI training data it has used in its own models with public domain alternatives like the books in the new Harvard database. “We use publicly available data for the purposes of training our models,” Davis says.

    As dozens of lawsuits filed over the use of copyrighted data for training AI wind their way through the courts, the future of how artificial intelligence tools are built hangs in the balance. If AI companies win their cases, they’ll be able to keep scraping the internet without needing to enter into licensing agreements with copyright holders. But if they lose, AI companies could be forced to overhaul how their models get made. A wave of projects like the Harvard database are plowing forward under the assumption that—no matter what happens—there will be an appetite for public domain datasets.

    In addition to the trove of books, the Institutional Data Initiative is also working with the Boston Public Library to scan millions of articles from different newspapers now in the public domain, and it says it’s open to forming similar collaborations down the line. The exact way the books dataset will be released is not settled. The Institutional Data Initiative has asked Google to work together on public distribution, but the search giant hasn’t publicly agreed to host it yet, though Harvard says it’s optimistic it will. (Google did not respond to WIRED’s requests for comment.)

    Share. Facebook Twitter Pinterest LinkedIn WhatsApp Reddit Email
    Previous ArticleSteam is adding a new default option for game updates
    Next Article Muscle Implants Could Allow Mind-Controlled Prosthetics—No Brain Surgery Required

    Related Posts

    What Happens When Your Coworkers Are AI Agents

    What Happens When Your Coworkers Are AI Agents

    December 9, 2025
    San Francisco Mayor Daniel Lurie: ‘We Are a City on the Rise’

    San Francisco Mayor Daniel Lurie: ‘We Are a City on the Rise’

    December 9, 2025
    An AI Dark Horse Is Rewriting the Rules of Game Design

    An AI Dark Horse Is Rewriting the Rules of Game Design

    December 9, 2025
    Watch the Highlights From WIRED’s Big Interview Event Right Here

    Watch the Highlights From WIRED’s Big Interview Event Right Here

    December 9, 2025
    Amazon Has New Frontier AI Models—and a Way for Customers to Build Their Own

    Amazon Has New Frontier AI Models—and a Way for Customers to Build Their Own

    December 4, 2025
    AWS CEO Matt Garman Wants to Reassert Amazon’s Cloud Dominance in the AI Era

    AWS CEO Matt Garman Wants to Reassert Amazon’s Cloud Dominance in the AI Era

    December 4, 2025
    Our Picks
    The pint-sized Sonos Roam 2 is more over 20 percent this weekend

    The pint-sized Sonos Roam 2 is more over 20 percent this weekend

    March 22, 2026
    Online age checks came first — a VPN crackdown could be next

    Online age checks came first — a VPN crackdown could be next

    March 22, 2026
    Halide co-founder is suing former partner Sebastiaan de With for taking source code to Apple

    Halide co-founder is suing former partner Sebastiaan de With for taking source code to Apple

    March 21, 2026
    The AirPods Pro 3 are  off right now, nearly matching their best-ever price

    The AirPods Pro 3 are $50 off right now, nearly matching their best-ever price

    March 21, 2026
    • Facebook
    • Twitter
    • Pinterest
    • Instagram
    • YouTube
    • Vimeo
    Don't Miss
    Here are 20 of our favorite outdoor deals from REI’s Member Days Sale News

    Here are 20 of our favorite outdoor deals from REI’s Member Days Sale

    By News RoomMarch 21, 2026

    REI’s latest sale is in full swing. The outdoor retailer’s exclusive shopping event runs through…

    An early contender for movie of the year

    An early contender for movie of the year

    March 21, 2026
    The new MacBook Pro is still fast as hell

    The new MacBook Pro is still fast as hell

    March 21, 2026
    Dreame’s self-cleaning L10s Pro Ultra is nearly ,000 off its original list price

    Dreame’s self-cleaning L10s Pro Ultra is nearly $1,000 off its original list price

    March 21, 2026
    Facebook X (Twitter) Instagram Pinterest
    • Privacy Policy
    • Terms of use
    • Advertise
    • Contact
    © 2026 Technology Mag. All Rights Reserved.

    Type above and press Enter to search. Press Esc to cancel.