Close Menu
Technology Mag

    Subscribe to Updates

    Get the latest creative news from FooBar about art, design and business.

    What's Hot
    GoTrax’s Mustang Electric Bike Makes Me Feel Like I’m in ‘Stranger Things’

    GoTrax’s Mustang Electric Bike Makes Me Feel Like I’m in ‘Stranger Things’

    December 6, 2025
    Gear News of the Week: Google Drops Another Android Update, and the Sony A7 V Is Here

    Gear News of the Week: Google Drops Another Android Update, and the Sony A7 V Is Here

    December 6, 2025
    Cloudflare Has Blocked 416 Billion AI Bot Requests Since July 1

    Cloudflare Has Blocked 416 Billion AI Bot Requests Since July 1

    December 6, 2025
    Facebook X (Twitter) Instagram
    Subscribe
    Technology Mag
    Facebook X (Twitter) Instagram YouTube
    • Home
    • News
    • Business
    • Games
    • Gear
    • Reviews
    • Science
    • Security
    • Trending
    • Press Release
    Technology Mag
    Home » Publishers Target Common Crawl In Fight Over AI Training Data
    Business

    Publishers Target Common Crawl In Fight Over AI Training Data

    News RoomBy News RoomJune 14, 20243 Mins Read
    Facebook Twitter Pinterest LinkedIn Reddit WhatsApp Email
    Publishers Target Common Crawl In Fight Over AI Training Data

    Danish media outlets have demanded that the nonprofit web archive Common Crawl remove copies of their articles from past data sets and stop crawling their websites immediately. This request was issued amid growing outrage over how artificial intelligence companies like OpenAI are using copyrighted materials.

    Common Crawl plans to comply with the request, first issued on Monday. Executive director Rich Skrenta says the organization is “not equipped” to fight media companies and publishers in court.

    The Danish Rights Alliance (DRA), an association representing copyright holders in Denmark, spearheaded the campaign. It made the request on behalf of four media outlets, including Berlingske Media and the daily newspaper Jyllands-Posten. The New York Times made a similar request of Common Crawl last year, prior to filing a lawsuit against OpenAI for using its work without permission. In its complaint, the New York Times highlighted how Common Crawl’s data was the most “highly weighted data set” in GPT-3.

    Thomas Heldrup, the DRA’s head of content protection and enforcement, says that this new effort was inspired by the Times. “Common Crawl is unique in the sense that we’re seeing so many big AI companies using their data,” Heldrup says. He sees its corpus as a threat to media companies attempting to negotiate with AI titans.

    Although Common Crawl has been essential to the development of many text-based generative AI tools, it was not designed with AI in mind. Founded in 2007, the San Francisco–based organization was best known prior to the AI boom for its value as a research tool. “Common Crawl is caught up in this conflict about copyright and generative AI,” says Stefan Baack, a data analyst at the Mozilla Foundation who recently published a report on Common Crawl’s role in AI training. “For many years it was a small niche project that almost nobody knew about.”

    Prior to 2023, Common Crawl did not receive a single request to redact data. Now, in addition to the requests from the New York Times and this group of Danish publishers, it’s also fielding an uptick of requests that have not been made public.

    In addition to this sharp rise in demands to redact data, Common Crawl’s web crawler, CCBot, is also increasingly thwarted from accumulating new data from publishers. According to the AI detection startup Originality AI, which often tracks the use of web crawlers, more than 44 percent of the top global news and media sites block CCBot. Apart from BuzzFeed, which began blocking it in 2018, most of the prominent outlets it analyzed—including Reuters, the Washington Post, and the CBC—spurned the crawler in only the last year. “They’re being blocked more and more,” Baack says.

    Common Crawl’s quick compliance with this kind of request is driven by the realities of keeping a small nonprofit afloat. Compliance does not equate to ideological agreement, though. Skrenta sees this push to remove archival materials from data repositories like Common Crawl as nothing short of an affront to the internet as we know it. “It’s an existential threat,” he says. “They’ll kill the open web.”

    Share. Facebook Twitter Pinterest LinkedIn WhatsApp Reddit Email
    Previous ArticleGuliKit’s anti-drift Hall effect sticks are coming for your PS5, PS4, Xbox, and Switch Pro gamepads
    Next Article Picsart teams up with Getty to take on Adobe’s ‘commercially-safe’ AI

    Related Posts

    Amazon Has New Frontier AI Models—and a Way for Customers to Build Their Own

    Amazon Has New Frontier AI Models—and a Way for Customers to Build Their Own

    December 4, 2025
    AWS CEO Matt Garman Wants to Reassert Amazon’s Cloud Dominance in the AI Era

    AWS CEO Matt Garman Wants to Reassert Amazon’s Cloud Dominance in the AI Era

    December 4, 2025
    ByteDance and DeepSeek Are Placing Very Different AI Bets

    ByteDance and DeepSeek Are Placing Very Different AI Bets

    December 4, 2025
    Jeff Bezos’ New AI Venture Quietly Acquired an Agentic Computing Startup

    Jeff Bezos’ New AI Venture Quietly Acquired an Agentic Computing Startup

    December 4, 2025
    Melinda French Gates on Secrets: ‘Live a Truthful Life, Then You Don’t Have Any’

    Melinda French Gates on Secrets: ‘Live a Truthful Life, Then You Don’t Have Any’

    December 2, 2025
    WIRED Roundup: Gemini 3 Release, Nvidia Earnings, Epstein Files Fallout

    WIRED Roundup: Gemini 3 Release, Nvidia Earnings, Epstein Files Fallout

    December 2, 2025
    Our Picks
    Gear News of the Week: Google Drops Another Android Update, and the Sony A7 V Is Here

    Gear News of the Week: Google Drops Another Android Update, and the Sony A7 V Is Here

    December 6, 2025
    Cloudflare Has Blocked 416 Billion AI Bot Requests Since July 1

    Cloudflare Has Blocked 416 Billion AI Bot Requests Since July 1

    December 6, 2025
    The Oceans Are Going to Rise—but When?

    The Oceans Are Going to Rise—but When?

    December 6, 2025
    Taste the Future With the Best Meal Replacement Shakes

    Taste the Future With the Best Meal Replacement Shakes

    December 6, 2025
    • Facebook
    • Twitter
    • Pinterest
    • Instagram
    • YouTube
    • Vimeo
    Don't Miss
    The best Christmas gifts we love under News

    The best Christmas gifts we love under $50

    By News RoomDecember 5, 2025

    One of the easiest things to do during the holidays is spend too much money.…

    One week at the Luigi Mangione media circus

    One week at the Luigi Mangione media circus

    December 5, 2025
    You can now use Pixel phones as a Switch 2 webcam

    You can now use Pixel phones as a Switch 2 webcam

    December 5, 2025
    Chamberlain blocks smart home integrations with its garage door openers — again

    Chamberlain blocks smart home integrations with its garage door openers — again

    December 5, 2025
    Facebook X (Twitter) Instagram Pinterest
    • Privacy Policy
    • Terms of use
    • Advertise
    • Contact
    © 2025 Technology Mag. All Rights Reserved.

    Type above and press Enter to search. Press Esc to cancel.