The vocabulary of AI is arriving in newsrooms faster than most style guides can keep up. Editors are expected to make editorial, legal, and ethical decisions about content that involves concepts drawn from machine learning, digital forensics, intellectual property law, and international regulation, often in the same afternoon. This glossary is our attempt to give every editor a shared working language. We have grouped the terms thematically rather than alphabetically, because the concepts build on one another.

Provenance and Authentication

Content provenance refers to the verifiable record of where a piece of content came from, who created it, and what changes it has undergone. In a newsroom context, provenance is the chain of custody for an image, video, or text. Without it, we have no reliable basis for asserting that a photograph is what we claim it is. For a fuller treatment, see our complete guide to content provenance and C2PA.

C2PA (Coalition for Content Provenance and Authenticity) is a cross-industry technical standards body whose specification allows creators and publishers to attach cryptographically signed metadata to media files. A C2PA manifest travels with the file and records the tool used, the identity of the creator, and any edits applied. Signatories to the specification include major camera manufacturers, social platforms, and news organisations.

Content credentials are the human-readable label attached to a file that has been signed under the C2PA specification. They are the visible expression of what the underlying cryptographic manifest says about the content's origin.

Watermarking in the AI context refers to embedding an imperceptible signal into generated text, images, audio, or video that identifies the content as machine-produced. Watermarking differs from a visible byline or label: it is designed to survive screenshots, reformatting, and light editing. No watermarking system currently guarantees permanence, which is why it is treated as one layer of a broader provenance strategy rather than a complete solution.

Metadata is structured data embedded in a file that describes its properties. For images, metadata (often stored as EXIF or IPTC data) can record camera model, GPS coordinates, timestamp, and copyright notice. Stripping metadata, intentionally or through a platform's automatic compression, destroys a key layer of provenance.

Synthetic and Manipulated Media

Deepfake is a piece of video, audio, or imagery in which a person's likeness or voice has been synthesised or replaced using deep learning, typically without their consent. The term is now widely used in legislation and newsroom policy documents. Detection remains an active area of research; our buyer's guide to deepfake detection tools surveys what is currently available to editorial teams.

Synthetic media is the broader category: any audio, image, video, or text generated or substantially altered by an AI system. A deepfake is a specific type of synthetic media. The term matters editorially because not all synthetic media is malicious, but all of it requires labelling decisions. We explore those decisions in detail in our piece on synthetic media ethics and editorial guidelines.

Generative AI refers to AI models that produce new content (text, images, audio, video, code) rather than simply classifying or analysing existing content. Large language models and image diffusion models are the most commonly encountered examples in a newsroom setting.

Hallucination is the tendency of a large language model to produce plausible-sounding but factually incorrect statements. For editors, hallucination is not a minor quirk but a direct threat to accuracy: an AI-assisted draft may cite a study that does not exist or attribute a quote to someone who never said it.

Detection and Verification

AI content detection refers to software that attempts to identify text or media as machine-generated rather than human-produced. Detection tools analyse statistical patterns, linguistic features, or visual artefacts. Their accuracy varies significantly and no tool should be treated as definitive. For an explanation of the underlying technology, see our article on how AI content detection works.

Liveness detection is a specific technique used to determine, in real time, that a video stream shows an actual person rather than a recording or a synthetic face. It is increasingly relevant for newsrooms conducting remote interviews or source verification.

Digital forensics in an editorial context means the technical examination of a file to assess its authenticity: looking at compression artefacts, lighting inconsistencies, metadata anomalies, or model-specific signatures left by image generators.

Regulation and Rights

GPAI (General Purpose AI) is the term used in the EU AI Act to describe AI models trained on broad data that can perform a wide range of tasks. Publishers need to understand GPAI because the Act places transparency obligations on providers of GPAI models, including requirements to disclose training data summaries. The category includes the large language models used in newsroom tools.

Training data is the corpus of text, images, or other content used to train an AI model. The question of if publishers' archives can be used as training data without licence or compensation is now a live legal dispute in multiple jurisdictions. Our analysis of the publishers versus AI training data copyright fight covers the key cases.

Opt-out refers to the mechanisms, such as the Robots Exclusion Protocol or rights-reservation language, that publishers can use to signal that their content should not be scraped for AI training. The legal enforceability of opt-outs remains contested.

RAG (Retrieval-Augmented Generation) is a technique in which a language model is connected to an external database or live web index so that it draws on current, sourced documents when generating a response. RAG reduces but does not eliminate hallucination, and it introduces questions about which sources the model retrieves and how it weights them.

Editorial Responsibility Terms

Disclosure in an AI context means informing readers when AI tools have played a material role in producing content. Disclosure norms vary between outlets and no single industry standard has been adopted, though bodies including the Reuters Institute have called for clear labelling.

Human oversight is the principle that a qualified editorial decision-maker must review AI-assisted content before publication. It is both an ethical commitment and, under frameworks like the EU AI Act, a regulatory expectation for certain high-risk applications.

Adversarial input (also called prompt injection) refers to malicious instructions embedded in content that is fed to an AI system, designed to manipulate the model's output. Editors whose workflows route external content through AI summarisation tools should be aware that adversarial inputs could alter what the tool reports.

Fluency with these terms will not resolve the hard editorial questions, but it does allow us to ask them precisely, debate them with colleagues, and engage with the regulators and technologists who are shaping the environment our newsrooms operate in.

Sources

  • Coalition for Content Provenance and Authenticity (C2PA), Technical Specification, c2pa.org
  • European Parliament, EU AI Act (Regulation 2024/1689), Official Journal of the European Union
  • Reuters Institute for the Study of Journalism, annual Digital News Reports, reutersinstitute.politics.ox.ac.uk
  • Internet Engineering Task Force, Robots Exclusion Protocol (RFC 9309)
  • IPTC Photo Metadata Standard, iptc.org