Nearly 400 Newspapers Sue OpenAI and Microsoft Over AI Training Data

Written by Alexa Hill on June 25, 2026 in AI Industry & Policy

# Nearly 400 Newspapers Sue OpenAI and Microsoft Over AI Training Data

Nearly 400 Newspapers Sue OpenAI and Microsoft Over AI Training Data
The copyright war against generative AI just got significantly larger. Nearly 400 local newspapers have banded together to file a major lawsuit against OpenAI and Microsoft, accusing the tech giants of systematically scraping their content without permission to train artificial intelligence models—and profiting from their work while publishers struggle financially. This coordinated legal action represents the most substantial challenge yet from the news industry, amplifying existing lawsuits from heavyweight publishers like The New York Times and adding considerable weight to arguments that AI companies are built on a foundation of unauthorized content extraction.

The lawsuit, filed by a coalition of local newspapers representing outlets across the United States, names both OpenAI and Microsoft as defendants. The core allegation is straightforward: both companies harvested copyrighted news articles and reporting to train their large language models without compensation or consent. For publishers already grappling with digital disruption and declining ad revenue, the notion that their journalism—the product of expensive reporting, fact-checking, and editorial resources—is being freely consumed by billion-dollar AI companies to build commercially viable products represents an existential threat.

This coordinated action by hundreds of smaller publishers fills a critical gap in the legal landscape. While major outlets like The New York Times have pursued their own high-profile lawsuits, they represent only the top tier of media organizations. Local newspapers—many already operating with skeleton crews and shrinking budgets—lack the legal resources to sue independently. By joining forces, these publishers create a compelling narrative about systemic exploitation across the entire news industry, from major metropolitan papers down to regional and community outlets. The sheer number of plaintiffs makes it harder for AI companies to dismiss the claims as isolated grievances from a few large competitors.

A Growing Legal Minefield for AI Development

The newspaper lawsuit arrives as OpenAI and Microsoft face an expanding array of legal challenges from various industries. The New York Times filed its own landmark suit in late 2023, claiming billions of dollars in damages for copyright infringement. Other major publishers have followed, including Ziff Davis, Merriam-Webster, and Encyclopedia Britannica. Authors have launched their own class-action suits, arguing that their books were used to train AI without permission or payment. Meanwhile, visual artists and photographers are waging similar battles in the image generation space, challenging tools like Stable Diffusion and Midjourney.

This convergence of lawsuits exposes a fundamental flaw in how AI companies approached data collection during the rapid scaling phase of generative AI development. Many models were trained on indiscriminately scraped internet content, with companies operating under the assumption that training use constituted "fair use" under copyright law. That legal theory now faces serious challenges across multiple fronts. Courts haven't yet definitively ruled on whether using copyrighted material to train commercial AI systems qualifies as fair use, but the mounting legal pressure suggests the existing framework may not be adequate for the AI era.

The stakes extend beyond individual lawsuits. If courts rule against AI companies on a large scale, it could force a fundamental restructuring of how generative AI systems are built. Retraining models with only licensed or properly cleared content would be expensive and time-consuming. It could also create competitive advantages for AI companies that establish licensing agreements early, pricing out smaller competitors that can't afford to pay for training data.

The Complex Dance of Licensing and Power Asymmetries

Interestingly, even as lawsuits escalate, some publishers are simultaneously negotiating licensing deals with AI companies. OpenAI, Microsoft, and Meta have all begun entering into agreements with news organizations to use their content for training, typically with some form of compensation. These deals suggest that companies recognize the legal vulnerability of their scraping practices and are willing to pay for legitimacy. However, the licensing landscape reveals uncomfortable power dynamics within the media industry.

Large publishers with significant legal teams and negotiating leverage can demand favorable terms. The New York Times, for instance, negotiated its own deal with OpenAI while simultaneously suing—a strategy that provides both a financial safeguard and leverage in litigation. Smaller publishers lack that negotiating power. Many would accept licensing deals if offered, but AI companies have little incentive to negotiate with hundreds of regional papers individually when they can focus on high-profile outlets.

Even more problematic: publishers cannot effectively block all AI training crawlers. Google's dominance in search creates a catch-22 situation. News organizations need Google Search traffic to survive, but blocking Google's crawler could hurt their search visibility. Meanwhile, publishers have limited ability to prevent Google from using their content in Google News or other AI-powered products. While publishers can theoretically use robots.txt files to block specific crawlers, Google's search crawler is so essential to their business model that blocking it isn't really an option for most outlets.

This asymmetry reveals how the copyright landscape struggles to keep pace with technology. Traditional copyright law assumed direct distribution and copying—downloading a book, photocopying a newspaper. It didn't anticipate companies training machine learning models on billions of documents, extracting patterns and knowledge without directly "copying" readable content in the traditional sense. The legal system is now racing to catch up, with judges and legislators trying to determine whether using copyrighted material as algorithmic training data deserves the same protections as reproducing the work itself.

The near-400 newspaper coalition lawsuit pushes courts to grapple with these questions at scale. Unlike individual cases that might be dismissed as narrow disputes, this coordinated action forces confrontation with the undeniable reality: AI companies built valuable systems using journalism created by thousands of organizations without compensation. Whether courts rule this practice illegal or merely unethical remains an open question—but either way, the AI industry's days of free-range data scraping appear to be ending.





Most Recent Articles