The promise of AI detection tools is simple: paste in text, and the tool tells you whether a human or machine wrote it. The reality is far messier. We submitted the same writing to seven commercial AI detectors and compared results.
We used a 1,200-word essay about photojournalism history — written entirely by a human journalist. We also tested GPT-4o and Claude 3.5 Sonnet versions, and a hybrid where AI polished two paragraphs. Each was submitted to all seven tools within 24 hours in April 2026.
The Tools
GPTZero, Originality.ai, Turnitin AI Detection, Copyleaks, Sapling, Writer.com, and Crossplag.
Results: Human-Written Essay
GPTZero: 12% AI (correct). Originality.ai: 34% (uncertain). Turnitin: 8% (correct). Copyleaks: 67% — incorrectly flagged as AI-generated. Sapling: 42% (uncertain). Writer.com: 15% (correct). Crossplag: 29% (mostly human). One tool was confidently wrong. This is precisely the scenario NIST warned about.
Results: AI-Generated Essays
GPT-4o version detected consistently (86-99%). Claude version scored lower across the board (71-91%). Tools trained on OpenAI outputs are more confident detecting OpenAI text. As new models emerge, this model-specificity becomes a reliability problem.
Results: Hybrid Essay
The hybrid produced the most inconsistent results (22-71%). No tool correctly identified it as mixed-authorship — arguably the most important scenario, since hybrid authorship is the most common real-world use case.
What Explains the Disagreements?
Different underlying approaches: perplexity/burstiness metrics vs. neural classifiers vs. combinations. Different training data means different sensitivity profiles, which is why the same text gets different verdicts.
Our Recommendation
Treat AI detection as one input, never a verdict. Use at least two tools. Invest in provenance-based verification. Never accuse a writer based solely on detector output — the false positive rates make this ethically indefensible. We will repeat this test quarterly.