Publishers vs. AI: The Training Data Copyright Fight

The New York Times' lawsuit against OpenAI, filed in late 2023, was just the opening salvo in what has become a defining legal battle. For a comprehensive view, see our running tracker of AI copyright lawsuits for digital media. Publishers argue that AI companies have built valuable products by consuming copyrighted journalism without permission or payment. AI companies argue that training on published content constitutes fair use. The outcome will shape the future of both industries.

The Publishers' Case

Publishers' legal arguments rest on straightforward copyright principles. News articles are creative works protected by copyright. AI companies copied millions of these articles to train their models. This copying was done without permission and without compensation. Therefore, it constitutes copyright infringement.

The practical stakes go beyond legal theory. Publishers have watched their business models erode over two decades as digital platforms captured advertising revenue and reader attention. AI represents a new and potentially more severe threat - systems that can answer questions and generate content without directing users to sources at all.

If AI companies must license training data, publishers gain a new revenue stream and some leverage over how their content is used. If AI companies can train on published content freely, publishers become suppliers of raw material for products that may ultimately replace them.

"They took our content to build products that compete with us for our audience's attention. That's not fair use. That's appropriation of our investment in journalism for their commercial benefit." - Publisher legal filing, 2024

The AI Defense

AI companies argue that training on published content is transformative use that falls squarely within fair use doctrine. The models don't store or reproduce the training data - they learn patterns from it. The output is new content, not copies of the input. This is analogous, they argue, to how humans learn from reading without violating the copyright of what they read.

They also raise practical concerns. If training requires licensing every piece of content in the dataset, the administrative burden would be impossible. The internet contains billions of pages. Negotiating individual licenses with every copyright holder is not feasible. A ruling against AI companies could effectively halt development of large language models.

The technical reality is complex. Models can sometimes reproduce portions of training data verbatim, which undermines pure transformation arguments. But they more often generate novel text that reflects patterns learned from many sources, which supports transformation claims. Courts will have to determine which characterization better captures what AI systems actually do.

The Precedents

Both sides cite precedents supporting their position. Publishers point to cases establishing that commercial copying of copyrighted material requires permission, regardless of the copier's ultimate product. AI companies point to cases allowing copying for purposes like search indexing, where the copy enables new functionality rather than substituting for the original.

The Google Books case is particularly relevant. Google was permitted to scan millions of books and display snippets in search results because the use was transformative - it helped users find books rather than replacing them. AI companies argue their training is analogously transformative. Publishers argue that AI outputs, unlike search snippets, actually compete with original content for reader attention.

No existing precedent directly addresses large language models. Courts will have to reason by analogy from cases involving different technologies and different uses. The outcome is genuinely uncertain.

The Negotiated Future

While litigation proceeds, some publishers have pursued licensing deals. The Associated Press, Axel Springer, and others have agreed to allow AI training in exchange for payment. These deals establish that licensing is possible, which may undermine AI companies' arguments that it is impractical.

But the deals also create tensions within the publishing industry. Publishers who license their content may gain revenue while helping to build systems that ultimately harm the industry. Those who refuse may maintain leverage in litigation while forgoing near-term income. Collective action is difficult when individual incentives diverge.

The most likely outcome may be neither complete victory nor complete defeat for either side, but a negotiated framework that establishes some compensation for publishers while allowing AI development to continue. What that framework looks like - mandatory licensing, collective negotiation, statutory rates - remains to be determined.

Implications for Newsrooms

Whatever the legal outcome, the practical implications for newsrooms are profound. If publishers prevail, they gain some control over how their content is used by AI systems - but they also face pressure to license or be left out of AI training entirely. If AI companies prevail, publishers must adapt to a world where their content can be freely consumed by systems that may reduce demand for the original journalism.

The question of whether to block AI crawlers becomes more urgent in either scenario. And the underlying challenge - how to sustain journalism economically when AI can satisfy many reader needs without directing them to sources - remains regardless of copyright law.

The legal landscape around AI training and copyright remains highly uncertain. Courts in multiple jurisdictions are considering cases that could establish precedents for how copyright law applies to machine learning. The outcomes of these cases will determine not just the obligations of AI companies to publishers but the broader legal framework governing how AI systems interact with copyrighted works across all creative industries. For publishers, the stakes extend beyond immediate licensing revenue to fundamental questions about the value of original journalism in an AI-mediated information ecosystem.

Publishers vs. AI: The Training Data Copyright Fight

The Publishers' Case

The AI Defense

The Precedents

The Negotiated Future

Implications for Newsrooms

Related Analysis

Related Analysis

Every Major AI Copyright Lawsuit Involving Publishers in 2026: A Running Tracker

NYT vs. OpenAI: What the Landmark Copyright Case Means for All Publishers

AI Licensing Deals: Which Publishers Are Selling Content to LLM Companies