We Ran the Same Essay Through 7 AI Detectors - Here's What Happened

Editors Weblog StaffApril 16, 20262 min read

ai and authenticity

One essay, seven detectors, seven different answers. Our hands-on test reveals the state of AI detection in 2026.

The promise of AI detection tools is simple: paste in text, and the tool tells you whether a human or machine wrote it. The reality is far messier. We submitted the same writing to seven commercial AI detectors and compared results.

We used a 1,200-word essay about photojournalism history - written entirely by a human journalist. We also tested GPT-4o and Claude 3.5 Sonnet versions, and a hybrid where AI polished two paragraphs. Each was submitted to all seven tools within 24 hours in April 2026.

The Tools

GPTZero, Originality.ai, Turnitin AI Detection, Copyleaks, Sapling, Writer.com, and Crossplag.

Results: Human-Written Essay

GPTZero: 12% AI (correct). Originality.ai: 34% (uncertain). Turnitin: 8% (correct). Copyleaks: 67% - incorrectly flagged as AI-generated. Sapling: 42% (uncertain). Writer.com: 15% (correct). Crossplag: 29% (mostly human). One tool was confidently wrong. This is precisely the scenario NIST warned about.

Results: AI-Generated Essays

GPT-4o version detected consistently (86-99%). Claude version scored lower across the board (71-91%). Tools trained on OpenAI outputs are more confident detecting OpenAI text. As new models emerge, this model-specificity becomes a reliability problem.

Results: Hybrid Essay

The hybrid produced the most inconsistent results (22-71%). No tool correctly identified it as mixed-authorship - arguably the most important scenario, since hybrid authorship is the most common real-world use case.

What Explains the Disagreements?

Different underlying approaches: perplexity/burstiness metrics vs. neural classifiers vs. combinations. Different training data means different sensitivity profiles, which is why the same text gets different verdicts.

Our Recommendation

Treat AI detection as one input, never a verdict. Use at least two tools. Invest in provenance-based verification. Never accuse a writer based solely on detector output - the false positive rates make this ethically indefensible. We will repeat this test quarterly.

We Ran the Same Essay Through 7 AI Detectors - Here's What Happened

The Tools

Results: Human-Written Essay

Results: AI-Generated Essays

Results: Hybrid Essay

What Explains the Disagreements?

Our Recommendation

Related Analysis

GPTZero vs. Originality.ai vs. Turnitin: 2026 Head-to-Head Comparison

AI Detection in Non-English Languages: The Accuracy Gap

AI Detector False Positives: When Legitimate Writing Gets Flagged