Home AI & AuthenticityGenerative AI Training Data: Where It Comes From and Why Publishers Are Fighting

AI & Authenticity

Generative AI Training Data: Where It Comes From and Why Publishers Are Fighting

Editors Weblog StaffJune 16, 20265 min read

ai and authenticity

An explainer for editors on how large language models are trained, what content feeds them, and why publishers are taking legal and commercial action over the use of their journalism.

Every time a journalist asks a large language model to summarise a story or suggest a headline, that model is drawing on a vast reservoir of text assembled before it ever launched. Understanding how that reservoir was filled, and who owns what is in it, is now one of the most consequential questions in the news industry.

How generative models are trained

Generative AI models learn by processing enormous quantities of text. During a training run, a model reads billions of documents, adjusting its internal parameters until it can reliably predict how language works: what word follows another, how arguments are structured, what a credible news sentence sounds like. The richer and more varied the text, the more capable the resulting model tends to be. That is why the open web, including the archives of professional publishers, became such an attractive source of training material.

The main datasets used to train frontier models include Common Crawl, a non-profit web archive that has been scraping publicly accessible pages since 2008, as well as curated sets such as WebText, Books1, Books2, and the C4 dataset compiled by Google researchers. Investigative work by researchers at the Allen Institute for AI and others has shown that journalism, magazine writing, and long-form non-fiction are disproportionately represented in these corpora because they provide grammatically rich, factually grounded text.

Why publishers say this is a problem

For most news organisations, the argument is straightforward: their journalists produced the content, their editors shaped it, and their companies paid for it. Training a commercial AI product on that content, without permission or payment, looks like unlicensed reproduction at scale. The legal question turns on how copyright doctrine treats the ingestion of protected text for machine learning purposes, and courts in the United States and Europe are still working through that question.

The most prominent case in the United States is The New York Times, which filed suit against OpenAI and Microsoft in December 2023, alleging that millions of its articles were used to train models without authorisation. The Times argued that the models can reproduce substantial portions of its journalism verbatim in some circumstances, which it says goes beyond any fair use defence. For a fuller picture of how similar disputes have developed across the industry, our running tracker of AI copyright lawsuits covers actions from multiple publishers in different jurisdictions.

The commercial response: licensing deals

Not every publisher has chosen litigation. A parallel track has emerged in which news organisations negotiate licensing agreements that permit AI companies to use their content in exchange for fees. Associated Press, Axel Springer, and News Corp are among the publishers that have signed deals with OpenAI. These arrangements vary in structure: some cover future content for retrieval-augmented generation, others address retroactive use of archives, and many of the specific terms remain confidential.

The existence of these deals does two things at once. It validates the publishers' core argument, that their content has genuine commercial value in the AI pipeline, and it creates a split in the industry between those who see licensing as pragmatic and those who argue it legitimises a practice that should first be adjudicated in court. For a closer look at how these agreements are structured, our analysis of AI licensing deals and what publishers are actually signing breaks down the known terms.

What is actually in the training data

One complication for publishers pursuing legal or commercial remedies is that AI companies have historically disclosed very little about the precise composition of their training sets. OpenAI's GPT-4 technical report, published in March 2023, gave almost no specifics about training data. Meta's LLaMA models have been more transparent: the LLaMA 2 paper, published in July 2023, listed categories of sources including CommonCrawl, C4, GitHub, Wikipedia, books, and ArXiv. Researchers studying these datasets have found that news content, particularly from well-edited outlets, appears frequently because crawlers prioritise high-PageRank domains.

The practical implication for editors is this: your publication's archives may well be inside a model you are currently using or evaluating, regardless of any permission your organisation did or did not grant. That is not a theoretical concern but a structural feature of how the first generation of large language models was built.

The downstream consequences for publishers

The training data dispute connects directly to a second concern: if AI systems trained on journalism can now produce journalism-like text, and if readers increasingly encounter that text in AI-generated summaries rather than on publisher websites, the economic model that funds original reporting comes under pressure. Early data on how AI-generated search features are affecting referral traffic is still accumulating, and our piece on Google AI Overviews and publisher traffic covers what is known so far.

The broader stakes are laid out in our pillar piece on the publishers versus AI training data copyright fight, which situates these legal and commercial disputes within the longer history of the web's relationship with news organisations.

What editors should take away

Generative models are trained on large web crawls in which professional journalism is well represented, often without explicit consent.
The legal framework for machine learning and copyright is unsettled, with major cases still proceeding through courts.
Licensing deals offer one path forward, but terms vary widely and many remain undisclosed.
Transparency about training data composition remains limited, making it difficult for publishers to quantify the use of their specific content.
The training data question and the referral traffic question are linked: both affect the economic sustainability of original reporting.

For newsrooms navigating these decisions, the starting point is clarity about what position your organisation is taking and why, before those decisions are made by default.

Sources

The New York Times v. Microsoft Corp. and OpenAI, complaint filed December 2023, U.S. District Court for the Southern District of New York.
OpenAI, "GPT-4 Technical Report," March 2023.
Touvron et al., "LLaMA 2: Open Foundation and Fine-Tuned Chat Models," Meta AI, July 2023.
Dodge et al., "Documenting Large Webtext Corpora," Allen Institute for AI, EMNLP 2021.
Common Crawl Foundation, commoncrawl.org.
Associated Press, Axel Springer, and News Corp licensing announcements with OpenAI (various dates, 2023-2024).

Frequently asked questions

What data is used to train generative AI models?

Most frontier models are trained on large web crawls such as Common Crawl, supplemented by curated datasets covering books, Wikipedia, code repositories, and other text. Journalism from established outlets is well represented because crawlers tend to prioritise high-authority domains.

Did AI companies get permission to use publisher content for training?

In most cases, the first generation of large language models was trained on web-crawled data without direct licensing agreements with publishers. That practice is now at the centre of multiple copyright lawsuits, including the case filed by The New York Times against OpenAI and Microsoft in December 2023.

Why are some publishers signing deals instead of suing?

Publishers such as Associated Press, Axel Springer, and News Corp have signed licensing agreements with OpenAI that provide revenue in exchange for access to their content. Some publishers view this as pragmatic, while others argue the practice of ingesting content without consent should be resolved legally before any licensing framework is normalised.

How does the training data dispute connect to referral traffic concerns?

If AI models trained on journalism can generate journalism-like summaries, readers may consume that content inside AI interfaces rather than visiting publisher websites. That reduces referral traffic, which in turn threatens the advertising and subscription revenue that funds original reporting.

Can publishers find out if their content was used in a specific model's training data?

Not easily. AI companies have disclosed very little about the precise composition of their training sets. Researchers can study publicly documented datasets and infer which sources were likely included, but publishers generally cannot obtain definitive confirmation without litigation or contractual transparency requirements.

generative ai training data ai copyright publishers and ai large language models news industry

Related Analysis