Skip to main content

How Readers Evaluate AI-Assisted Essays — 22% Penalty

Same essay, different label. Foundry10 readers rated identical text 22% less authentic and 29% less ethical when told ChatGPT helped.

Nirmal Thacker, CS, Georgia Tech · Cerebras Systems AIMay 9, 202613 min read
Free Essay ReviewAI detection + scoring

How Readers Evaluate AI-Assisted Essays — 22% Penalty

Foundry10 ran a clean experiment. They handed evaluators the same college-essay paragraph and randomized one detail: a short vignette telling readers either that ChatGPT had helped write it, that an admissions coach had helped, or that the student had written it alone. The text never changed. The label did. The same paragraph was rated 22% less authentic, 29% less ethical, and 16% less competent when readers were told ChatGPT was involved (Foundry10, July 2024).

That is a remarkable finding because the essay's quality was held constant by design. Whatever penalty the AI-labeled paragraph absorbed, it absorbed for the label, not for the writing.

This is the second entry in our How Admissions Readers Evaluate series, after How Readers Evaluate the CASPA AI and Technology Essay. Where that piece was about what strong responses do, this one is about what readers project onto a piece of writing the moment they're told an AI tool touched it. It is a study in attribution, not in prose.

Updated: May 10, 2026. We revise this post as new research emerges.

What the experiment was

The randomized vignette study sits inside Foundry10's broader 2024 white paper on AI in college applications, which surveyed 425 US high-school teachers and 523 US teens (16–18, applied to college that cycle) via the Sago panel in February–March 2024 (Foundry10, July 2024).

The experiment itself is simple and elegant:

  • Evaluators read the same essay paragraph — a personal-statement excerpt held constant across every condition.
  • Each evaluator was randomly assigned one of three vignettes describing how the essay was produced: (1) the applicant used ChatGPT to help, (2) the applicant worked with an admissions coach, or (3) the applicant wrote it alone with no help.
  • Applicant names were also randomized across 24 combinations meant to vary apparent race and gender, so the team could test whether the AI penalty differed for different applicants.
  • Evaluators rated the essay on four dimensions: authenticity, competence, ethics, and likability, using a 5-point scale.

Because the text is identical and the only manipulation is the disclosure, any difference in ratings is a measurement of how the attribution moves the rating — not of how the essay itself reads.

A caveat to register up front, because it matters for everything that follows: Foundry10's evaluator pool is a survey panel, not a panel of seasoned admissions officers reading their three-thousandth essay of the cycle. We'll come back to this. The numbers are still worth taking seriously — the experimental design is among the cleanest in the field — but they are an upper-bound proxy for reader perception, not a direct measurement of what a Yale or UMich officer would do.

The numbers

Across all four dimensions, the ChatGPT vignette carried a measurable, statistically significant penalty against an otherwise-identical essay (Foundry10, July 2024). All ChatGPT-vs-other-conditions differences are significant at p < 0.001.

DimensionChatGPT vignetteCoach vignetteNo-help vignette
Authenticity3.093.683.98
Competence3.423.864.04
Ethics2.863.574.04
Likability3.273.553.65

A few patterns are worth pulling out:

  • Ethics takes the biggest hit. A 2.86 rating for an essay labeled as ChatGPT-assisted, against 4.04 for the same essay labeled as unaided, is a 29% drop. Readers are not just downgrading the essay — they are forming a moral judgment about the applicant.
  • Authenticity is right behind it. 3.09 vs. 3.98 is a 22% drop on the dimension that personal essays exist to demonstrate. The personal statement is, by genre, supposed to sound like a person. Telling a reader that a chatbot helped breaks that contract before the first sentence is read.
  • Competence drops too — but by less. A 16% drop (3.42 vs. 4.04) is meaningful but smaller, suggesting readers can still see craft in a paragraph even when they're suspicious of how it got there. They just don't credit the applicant with as much of it.
  • Likability barely moves. A 10% drop (3.27 vs. 3.65) is the smallest delta. Readers don't necessarily dislike the AI-assisted applicant; they just don't trust the artifact in front of them.
  • Coaching is also penalized — but less than AI. The coach vignette sits in the middle on every dimension. Authenticity 3.68 vs. the unaided 3.98 is a real penalty (about 8%), but it's a fraction of what the ChatGPT label costs. Help from a human is treated as a different category of assistance than help from a model.

The takeaway is not that the essay was bad. The essay was the same essay every time. The takeaway is that the label changes how readers see the same words.

The bias against AI users does NOT depend on the applicant's race or gender

This is the result that often gets lost. The 24-name randomization across apparent race and gender combinations was designed to test whether the AI penalty fell harder on some applicants than others. It didn't. Applicant name, race, and gender did not moderate the AI-attribution penalty (Foundry10, July 2024).

Read this carefully, because it cuts in two directions:

  • The bias is about the AI label, not about who is using AI. Evaluators didn't selectively forgive AI use for some demographic groups and punish it for others. The ChatGPT vignette tanked ratings uniformly.
  • That is not the same as saying AI use in admissions is demographically neutral. Other research finds the exposure to AI is uneven — see the income cliff in college-application AI use, which documents that lower-income students adopt AI for application essays at very different rates than higher-income peers, and that the admissions outcomes diverge sharply by SES post-ChatGPT (Cornell, Feb 2026).

The Foundry10 result is narrower than the headline often makes it sound. It says: when readers are told an essay used AI, the penalty they apply does not depend on who wrote it. The penalty itself is real — and it is uniform.

Why readers penalize the label

Foundry10 ran the experiment but didn't claim a definitive mechanism. The most plausible explanations are framings, not findings. We're flagging them as such:

1. The personal essay is, by genre, an authenticity contract. The entire point of the personal statement is to give a reader something a transcript can't: a voice, a perspective, a way of sitting with experience. When a reader is told a model helped produce the artifact, the contract feels broken on its face — even if the prose is identical to what the applicant could have written alone. Authenticity is the single biggest delta in the data (22%) for a reason.

2. AI assistance reads as substitution, not collaboration. Readers seem to draw a category line between "a coach helped me think through this" (still my essay) and "a model generated this" (not my essay), even when the actual scope of help is similar. The 8% authenticity penalty for the coach vignette is real but small. The 22% penalty for the ChatGPT vignette is qualitatively different. Readers appear to be inferring scale of assistance from the type of helper.

3. The ethics drop carries an implicit fairness frame. Some students pay $79–$349 per hour for application coaching (Foundry10, July 2024). Others can pay $20 a month for ChatGPT Plus. The ethics rating may be picking up not just "you used a tool you shouldn't have" but "you used a tool that not every applicant can or will use, and you didn't tell us." Whether that's a fair inference is a separate question — but it tracks the largest score delta in the data (29%).

4. Disclosure itself triggers suspicion. The vignette did not say the essay was generated entirely by ChatGPT. It said ChatGPT helped. Readers seem to round up — to assume that "helped" implies more than it states. That's important for the practical implications below.

None of these are tested in the Foundry10 paper. They're hypotheses consistent with the data. The empirically grounded claim is the smaller one: the same paragraph was rated significantly worse on every dimension when readers were told ChatGPT was involved.

What this means for the disclosure decision

If you're deciding whether to tell a college you used AI to write your essay, the Foundry10 numbers are evidence that disclosure carries real cost — at least with this proxy population. The penalty isn't just for heavy AI use; it's for the label itself. A vignette that said "ChatGPT helped" was enough to drop authenticity 22% and ethics 29%, regardless of what the underlying essay actually looked like.

This makes the disclosure decision genuinely hard. It is not the case that "honesty always wins" in this experiment — telling readers about AI assistance, even modest assistance, costs you on every dimension they rated.

It is also not a license to hide AI use. Many schools now ask directly, and the AI attestation many applications now include carries its own consequences if you check the wrong box. We've covered the practical decision in Should You Tell Colleges You Used AI?, and the wording question in How to Write an AI Disclosure on a College Application. The Foundry10 data sharpens what's at stake on the perception side: if you do disclose, the penalty isn't theoretical.

The honest synthesis: disclosure is not free, and it should not be cosmetic. If your AI use was minor (light brainstorming, a grammar pass), forced disclosure on a school that didn't ask will likely cost you ratings without any offsetting integrity premium. If your AI use was substantive, disclosure is the right call ethically — and the Foundry10 numbers tell you to be honest about why and how, because readers will fill in the worst case if you're vague.

What this means for actual admissions readers

This is the most important caveat in the post, and it's worth its own section.

Foundry10's evaluators are not real admissions officers. They were a survey panel of US adults recruited via Sago. They likely did not have years of reading 3,000 essays per cycle, calibrating against institutional benchmarks, or making admit/deny decisions. What they were doing was rating an essay on Likert scales after reading a vignette. That is not the same task an admissions officer performs.

Real admissions officers might react to AI disclosure in any of several ways that this experiment can't measure:

  • More suspicious than the panel. Officers see thousands of essays. They have stronger priors about what authentic writing looks like and may apply harsher penalties when AI is disclosed.
  • Less suspicious than the panel. Officers are professionals who explicitly weigh context, may have institution-level guidance on how to treat disclosed AI use, and may be more willing to read the actual essay on its merits.
  • Differently suspicious. Officers may be more sensitive to specific markers — generic vocabulary, formulaic structure, the lexical fingerprint covered in the words that make a college essay sound AI-written — and less reactive to the bare disclosure.

No published study has tested AI-disclosure perception with a real admissions officer panel under randomized conditions. Until one does, the Foundry10 numbers are the best evidence we have and a proxy. Treat them as directional, not definitive.

The broader landscape: the 3 major studies on AI in college admissions we synthesize in our pillar piece all share variants of this caveat — sample-of-one institution, panel proxy, or self-report. The reader-perception side of admissions is genuinely under-studied. Foundry10 is the cleanest signal available, which is also a low bar.

Practical implications for applicants

Three things, in order of confidence:

1. Don't volunteer AI assistance to schools that don't ask. If your application doesn't include an AI-attestation question and you used AI lightly (brainstorming, outline help, grammar pass), the Foundry10 data says volunteering that information will cost you on every rated dimension. Spend the cognitive energy on writing a better essay instead. (For schools that do ask, see the disclosure-decision post and the attestation overview.)

2. If you do disclose, be specific about scope. The vignette in the experiment was vague — "ChatGPT helped." Readers rounded up. If you have to disclose, don't be vague. "I used ChatGPT to brainstorm five potential opening anecdotes; I drafted, revised, and finalized the essay myself" is harder to round into "the model wrote it." Specificity is the only counter to readers projecting the worst case.

3. The strongest essay is one a reader could not project AI onto in the first place. The Foundry10 penalty is a perception penalty. The Cornell lexical-fingerprint work documents what AI essays statistically over-use — abstract prompt-keywords like "challenge," "growth," "journey," "resilience" — and what human essays favor instead (Cornell, Jan 2026). If your essay is full of specific names, places, sensory detail, and dated moments, no reader is going to project ChatGPT onto it, and disclosure (if required) is much less costly.

What we'd avoid recommending: writing worse on purpose to seem "more human." That's dumbcrafting, and it's the wrong response to this finding. The Foundry10 penalty is for the label, not for polished writing. Strong, specific, vivid writing is the right answer; deliberately weakened writing is not.

What we'll update next

We're tracking two specific research questions and will revise this post when either resolves:

  1. Has anyone published a randomized AI-disclosure perception study with a real admissions-officer panel? Foundry10's panel is the cleanest design we have right now, but the proxy gap is the single biggest weakness in the evidence base. A real-officer replication — even a small one — would change what we recommend.
  2. Does the AI penalty differ by school tier or essay type? The Foundry10 study used one essay paragraph and one undifferentiated reader pool. We'd expect different magnitudes for highly selective vs. open-enrollment schools, for personal statements vs. supplemental "why us" essays, and for undergraduate vs. graduate applications. None of those interactions have been tested.

If you're a researcher working on either of these, we'd like to read the work — and we'll update this post when we do.


Series: How Admissions Readers Evaluate the CASPA AI and Technology Essay · Pillar: What the Research Says About AI in College Admissions · Companion: The Income Cliff in College-Application AI Use

Quick AI Check

See if your essay will pass university AI detection in seconds.

Related Articles

Your Essay Deserves a Second Look

Professional AI detection and comprehensive scoring before you submit

No credit card required