Every time a journalist asks a large language model to summarise a story or suggest a headline, that model is drawing on a vast reservoir of text assembled before it ever launched. Understanding how that reservoir was filled, and who owns what is in it, is now one of the most consequential questions in the news industry.

How generative models are trained

Generative AI models learn by processing enormous quantities of text. During a training run, a model reads billions of documents, adjusting its internal parameters until it can reliably predict how language works: what word follows another, how arguments are structured, what a credible news sentence sounds like. The richer and more varied the text, the more capable the resulting model tends to be. That is why the open web, including the archives of professional publishers, became such an attractive source of training material.

The main datasets used to train frontier models include Common Crawl, a non-profit web archive that has been scraping publicly accessible pages since 2008, as well as curated sets such as WebText, Books1, Books2, and the C4 dataset compiled by Google researchers. Investigative work by researchers at the Allen Institute for AI and others has shown that journalism, magazine writing, and long-form non-fiction are disproportionately represented in these corpora because they provide grammatically rich, factually grounded text.

Why publishers say this is a problem

For most news organisations, the argument is straightforward: their journalists produced the content, their editors shaped it, and their companies paid for it. Training a commercial AI product on that content, without permission or payment, looks like unlicensed reproduction at scale. The legal question turns on how copyright doctrine treats the ingestion of protected text for machine learning purposes, and courts in the United States and Europe are still working through that question.

The most prominent case in the United States is The New York Times, which filed suit against OpenAI and Microsoft in December 2023, alleging that millions of its articles were used to train models without authorisation. The Times argued that the models can reproduce substantial portions of its journalism verbatim in some circumstances, which it says goes beyond any fair use defence. For a fuller picture of how similar disputes have developed across the industry, our running tracker of AI copyright lawsuits covers actions from multiple publishers in different jurisdictions.

The commercial response: licensing deals

Not every publisher has chosen litigation. A parallel track has emerged in which news organisations negotiate licensing agreements that permit AI companies to use their content in exchange for fees. Associated Press, Axel Springer, and News Corp are among the publishers that have signed deals with OpenAI. These arrangements vary in structure: some cover future content for retrieval-augmented generation, others address retroactive use of archives, and many of the specific terms remain confidential.

The existence of these deals does two things at once. It validates the publishers' core argument, that their content has genuine commercial value in the AI pipeline, and it creates a split in the industry between those who see licensing as pragmatic and those who argue it legitimises a practice that should first be adjudicated in court. For a closer look at how these agreements are structured, our analysis of AI licensing deals and what publishers are actually signing breaks down the known terms.

What is actually in the training data

One complication for publishers pursuing legal or commercial remedies is that AI companies have historically disclosed very little about the precise composition of their training sets. OpenAI's GPT-4 technical report, published in March 2023, gave almost no specifics about training data. Meta's LLaMA models have been more transparent: the LLaMA 2 paper, published in July 2023, listed categories of sources including CommonCrawl, C4, GitHub, Wikipedia, books, and ArXiv. Researchers studying these datasets have found that news content, particularly from well-edited outlets, appears frequently because crawlers prioritise high-PageRank domains.

The practical implication for editors is this: your publication's archives may well be inside a model you are currently using or evaluating, regardless of any permission your organisation did or did not grant. That is not a theoretical concern but a structural feature of how the first generation of large language models was built.

The downstream consequences for publishers

The training data dispute connects directly to a second concern: if AI systems trained on journalism can now produce journalism-like text, and if readers increasingly encounter that text in AI-generated summaries rather than on publisher websites, the economic model that funds original reporting comes under pressure. Early data on how AI-generated search features are affecting referral traffic is still accumulating, and our piece on Google AI Overviews and publisher traffic covers what is known so far.

The broader stakes are laid out in our pillar piece on the publishers versus AI training data copyright fight, which situates these legal and commercial disputes within the longer history of the web's relationship with news organisations.

What editors should take away

  • Generative models are trained on large web crawls in which professional journalism is well represented, often without explicit consent.
  • The legal framework for machine learning and copyright is unsettled, with major cases still proceeding through courts.
  • Licensing deals offer one path forward, but terms vary widely and many remain undisclosed.
  • Transparency about training data composition remains limited, making it difficult for publishers to quantify the use of their specific content.
  • The training data question and the referral traffic question are linked: both affect the economic sustainability of original reporting.

For newsrooms navigating these decisions, the starting point is clarity about what position your organisation is taking and why, before those decisions are made by default.

Sources

  • The New York Times v. Microsoft Corp. and OpenAI, complaint filed December 2023, U.S. District Court for the Southern District of New York.
  • OpenAI, "GPT-4 Technical Report," March 2023.
  • Touvron et al., "LLaMA 2: Open Foundation and Fine-Tuned Chat Models," Meta AI, July 2023.
  • Dodge et al., "Documenting Large Webtext Corpora," Allen Institute for AI, EMNLP 2021.
  • Common Crawl Foundation, commoncrawl.org.
  • Associated Press, Axel Springer, and News Corp licensing announcements with OpenAI (various dates, 2023-2024).