Skip to main content

AI Detector False Positive Rates: 2026 Data Compared

Independent 2026 data on AI detector false-positive rates. See which one flags the fewest real essays - and why one false flag can sink an application.

Nirmal Thacker, Founder, GradPilot · CS, Georgia TechMay 19, 202611 min read
Free Essay ReviewAI detection + scoring

AI Detector False Positive Rates: 2026 Data Compared

Quick answer: In independent 2025 University of Chicago Booth research, Pangram had the lowest false-positive rate — essentially zero across passage lengths and the only detector meeting a strict 0.5% policy cap without losing detection power. GPTZero and Originality.ai sit near 1%; Turnitin's roughly 1% flags hundreds of real essays at scale; open-source tools fail badly.

If you are an applicant, the false-positive rate is the AI-detector number that should keep you up at night. A false negative means one AI-written essay slips through — a policy nuisance. A false positive means a real student's genuine work gets flagged as machine-written, and the cost lands on an innocent person who often has no way to appeal. This article uses independent data wherever it exists and clearly labels every vendor's self-reported number.

Which AI detector has the lowest false-positive rate?

Here is the cross-detector comparison, split into two parts: false-positive rates on general academic writing, and false-positive rates on non-native (ESL) writing — the second is where international applicants get hurt most. Every cell is tagged [INDEPENDENT] (third-party research) or [FIRST-PARTY] (the vendor's own published claim).

Academic / general-writing false-positive rate

DetectorFalse-positive rateNotesSource type
PangramEssentially zero on long/medium passages; never above ~1% on short passagesOnly detector meeting a strict ≤0.5% policy cap without losing detection[INDEPENDENT] — Booth WP 2025-116
Originality.ai≤1% on medium-to-long; up to ~2-3% on shortClose second; falls slightly short at the strictest cap[INDEPENDENT] — Booth WP 2025-116
GPTZero≤1% on medium-to-long; up to ~2.4% on shortBooth notes "minimizing FPR favors GPTZero"[INDEPENDENT] — Booth WP 2025-116
Turnitin"Less than 1%" document-level (vendor); ~4% sentence-level per third-party readsVendor claim; the sentence-level and ESL gaps get buried[FIRST-PARTY] — turnitin.com
CopyleaksClaims ~0.2% (1 in 500)Vendor marketing claim, not independently verified[FIRST-PARTY]
RoBERTa (open-source baseline)~30-69% (flags most human text)"Loses functionality"; unsuitable for high-stakes use[INDEPENDENT] — Booth WP 2025-116

ESL / non-native-writer false-positive rate

Source / detectorFalse-positive rate on non-native writingSource type
The 2023 generation of detectors (Liang et al., Patterns)61.3% average on non-native TOEFL essays; 97.8% flagged by at least one of 7 detectors[INDEPENDENT] — Liang et al., Patterns, 2023
Pangram (re-tested the same TOEFL set; post-dates the study)0.00% on the Liang TOEFL set; ~0.032% overall across four ESL corpora[FIRST-PARTY] — pangram.com, Apr 2025
Turnitin (relayed)~1.4% on 300+ word second-language English[FIRST-PARTY / relayed]

How to read it: the independent numbers come from third-party research that tested these tools head to head; the first-party numbers are the vendors' own published figures — useful, but not the same as independent verification.

What a false-positive rate is, and why admissions cares most

A false-positive rate is the share of genuinely human-written text a detector wrongly labels as AI. In a classroom, a wrongful flag means an uncomfortable conversation; in admissions, it can mean a rejection an applicant never gets to contest — which is why colleges that adopt detection are increasingly asking about FPR before accuracy. We unpack how that adoption actually works in do colleges actually use AI detectors.

The harm is asymmetric, so do the Vanderbilt math. When Vanderbilt disabled Turnitin's AI detector in August 2023, it noted that even at Turnitin's claimed 1% false-positive rate, its roughly 75,000 annual submissions would mean about 750 students potentially flagged by mistake. One percent sounds tiny until you multiply it by a real applicant pool. That is why the Booth researchers frame detector choice around an FPR policy cap — for example, "no more than 0.5% of real work may be flagged" — rather than a headline accuracy percentage.

The 2026 data: what independent research found

The strongest independent evidence is the University of Chicago Booth working paper "Artificial Writing and Automated Detection" (Jabarian & Imas, BFI WP 2025-116, August 2025; lay summary in Chicago Booth Review). The team built a corpus of 1,992 pre-2020 human texts plus 1,992 AI texts across multiple genres and lengths, generated by four frontier language models, then stress-tested detectors on short "stub" passages and on text run through a "humanizer" tool.

On false positives, the working paper reports that Pangram achieves a zero false-positive rate on longer passages and essentially zero on medium ones, ticking up only slightly on short passages but never above 1%. GPTZero and Originality.ai both keep their FPR at or below 1% on medium-to-long passages and below 3% on shorter ones. The open-source RoBERTa baseline diverges sharply, flagging roughly 30-69% of human text — a reminder that the free tools students paste essays into are the least trustworthy of all.

Booth's headline conclusion for high-stakes use: Pangram is the only detector that meets a stringent FPR policy cap of 0.5% without compromising its ability to catch AI text, with Originality.ai a close second. This is an independent finding and the strongest defensible claim in the comparison — consistent with Pangram's standing as the default detector now used in college admissions. One caveat we will not bury: this is a working paper, not a settled verdict, and one vendor disputes its ranking — handled in full below.

Why international applicants get flagged most

The most important false-positive story for applicants is not about averages — it is about who gets flagged. The landmark independent study is Liang et al., "GPT detectors are biased against non-native English writers," in Patterns (Cell), 2023, summarized by Stanford HAI. Running 91 TOEFL essays and 88 native-speaker essays through seven detectors, the researchers found those tools falsely flagged 61.3% of the non-native essays as AI while classifying native-speaker essays nearly perfectly; 97.8% of the TOEFL essays were flagged by at least one detector.

The mechanism is revealing: most of those 2023 detectors keyed on text "perplexity," and second-language writers tend to use more predictable vocabulary and simpler structure, which reads as machine-like. The false-positive rate even fell from 61.3% to 11.6% when essays were rewritten with more elaborate, AI-style word choice — the tools effectively penalized authentic non-native voice. That is why we cover AI-detection bias against international students as its own topic.

Two honest framing notes. The 61.3% is a fact about the category of detectors that existed in 2023 — Liang's team did not test Pangram, which launched later, so no one was "vindicated" inside that study. Separately, Pangram's own re-test of that same TOEFL data reports a 0.00% false-positive rate, and roughly 0.032% across four ESL corpora — a first-party figure from Pangram (April 2025), not an independent result. The defensible takeaway: independent research established that 2023-era detectors were badly biased against non-native writers, and the detector that reports closing that gap reports doing so on the exact dataset that exposed it.

Vendors' own claims vs. independent tests

Detector marketing pages are full of impressive numbers, but almost all are self-reported, and a vendor benchmarking its own product is not independent evidence. The first-party claims:

  • Pangram publishes a roughly 0.01% overall false-positive rate (about 1 in 10,000), with academic essays around 0.004% (pangram.com, March 2025) — first-party, but directionally consistent with the independent Booth "essentially zero" finding.
  • GPTZero generally claims around 1%, and on its own re-run of the Booth data claims 0.05% (the dispute is next).
  • Turnitin says "less than 1%" at the document level; third-party reads put its sentence-level rate near 4% and its second-language rate near 1.4%. We cover why admissions teams moved off Turnitin separately.
  • Copyleaks claims about 0.2%.

What you should not trust is the "99.98% accurate" numbers from review-blog roundups for tools like Winston, ZeroGPT, or Sapling — those are low-rigor or vendor-sourced, and accuracy is not the same as false-positive rate. A tool can claim high accuracy while still wrongly flagging a meaningful share of real essays.

The GPTZero–Booth dispute, explained fairly

In January 2026, GPTZero publicly contested the Booth paper. Its objection is technical: GPTZero argues the researchers queried the wrong API field (average_generated_prob instead of class_probabilities), and that on a corrected re-run it shows a 0.05% false-positive rate with 99.3% recall — which it frames as fewer total errors than Pangram.

The honest reading: GPTZero's dispute is about its own ranking and about recall (catching AI text), not about Pangram's false-positive rate. Even its contested re-run does not claim a lower FPR than Pangram's near-zero. So on the metric admissions cares about most — false positives — Pangram leads in the independent study, and the dispute does not undercut that; it is a fight over recall. Either way, weigh Booth as a working paper whose ranking one vendor contests, not a closed case.

What this means if you are applying

If you are an applicant — especially an international or non-native English one — three things follow from the data:

  1. Do not paste your essay into free online detectors. The open-source and free-tier tools are exactly the ones with double-digit false-positive rates; running your own essay through them spikes your anxiety without telling you anything an admissions office would act on.
  2. A flag is evidence, not a verdict. Most institutions that use detection treat a flag as a prompt to look closer, not proof. To gauge when a flag actually warrants worry, read when to actually worry about being flagged and the real stories of students falsely accused.
  3. Know the rule that applies to you. Health-professions applicants face their own version — see false positives in CASPA and health-professions admissions — and you can look up the stance of programs on your list in our AI policy tracker.

How to read these numbers

Two cautions. Dates matter: the Booth corpus tested four frontier models as of mid-2025, and rankings shift as new language models and "humanizer" tools appear, so any "best detector" claim should be scoped to the models and date measured. And separate the source types — the cleanest independent signals here are Booth (2025) on academic writing and Liang et al. (2023) on non-native bias; everything a vendor publishes about its own product is a first-party claim that deserves the label.

FAQ

What is the false-positive rate of AI detectors? It varies enormously by tool. In the 2025 Booth study, the best detectors held false-positive rates at or below 1% on academic writing, while an open-source baseline flagged 30-69% of human text. On non-native English writing, the 2023 Liang study found 2023-era detectors falsely flagged 61.3% of TOEFL essays — so there is no single "the FPR" number; it depends on the detector and the writer.

Which AI detector has the lowest false-positive rate? Based on the independent 2025 University of Chicago Booth working paper, Pangram had the lowest — essentially zero across passage lengths and the only detector meeting a strict 0.5% policy cap without losing detection power. Originality.ai ranked a close second. GPTZero disputes the ranking, but its objection concerns recall, not Pangram's near-zero false-positive rate.

Are AI detectors biased against non-native English speakers? The 2023 detectors were, sharply. Liang et al. found those tools falsely flagged 61.3% of non-native TOEFL essays while scoring native-speaker essays nearly perfectly, because they keyed on predictable, simpler language. Newer detectors report having closed much of that gap, but the bias was real and well documented for the earlier generation.

What should I do if I am falsely flagged as AI? Stay calm — a flag is a prompt to look closer, not a conviction, and most institutions do not act on detector output alone. Keep your drafting evidence (version history, notes, outlines), ask what specific evidence the flag is based on, and read our guidance on when a flag actually warrants worry.


At GradPilot, we read your essay the way an admissions committee does — for specificity, voice, and narrative coherence — instead of running a detector to guess whether it "looks AI." The most reliable defense against a false flag is writing only you could have written, and that is what our AI policy tracker and essay tools help you produce. Detectors can be wrong; the goal is to make your real work unmistakable.

Quick AI Check

See if your essay will pass university AI detection in seconds.

Related Articles

Your Essay Deserves a Second Look

Professional AI detection and comprehensive scoring before you submit

No credit card required