Research AI in Education College Admissions

AI in College Admissions Research: 3 Major Studies

3 studies on AI in college admissions: detection hits F1=0.998, the SES admit gap widened 31% post-ChatGPT, and AI-labeled essays lose 22% authenticity.

Nirmal Thacker, CS, Georgia Tech · Cerebras Systems AIMay 10, 202614 min read

Free Essay ReviewAI detection + scoring

AI in College Admissions Research: 3 Major Studies

Three large studies from 2024 to 2026 — two from the same Cornell team, one from the nonprofit Foundry10 — have moved the conversation about AI in college admissions out of speculation and into evidence. They report that classifiers can separate human from LLM-written essays at F1 = 0.998 (Cornell, Jan 2026), that the admit-rate gap between higher- and lower-SES applicants widened by 31% after ChatGPT's release (Cornell, Feb 2026), and that when readers are told an applicant used ChatGPT, perceived authenticity drops from 3.98 to 3.09 on a 5-point scale (Foundry10, July 2024).

This post is the literature review. Not "do colleges check for AI" — for that, see the Turnitin truth and our 174-school policy classification. Here we ask the narrower, more useful question: what does the actual research say, where does it agree, where does it disagree, and what does it not yet know?

Updated: May 10, 2026. We revise this post as new research emerges.

TL;DR — three findings worth carrying around

The detector ceiling is much higher than the field assumed. A T5 classifier trained on 30,000 real Common App essays paired with 87,696 synthetic LLM essays separates the two at F1 = 0.998. A simple TF-IDF baseline gets 0.999. (Cornell, Jan 2026)
The SES admit gap widened, not narrowed, after ChatGPT. At one selective engineering school, the higher-SES vs. lower-SES admit gap went from 10.7 percentage points pre-ChatGPT to 14.0 pp post-ChatGPT — a 31% widening — even though lower-SES applicants used LLMs 28% more. (Cornell, Feb 2026)
Telling a reader "ChatGPT helped" costs about 22% of perceived authenticity. In a controlled vignette experiment, the same essay paragraph was rated 3.98 (no help) vs. 3.68 (admissions coach) vs. 3.09 (ChatGPT) for authenticity, with similar drops for competence and ethics. (Foundry10, July 2024)

Each of these findings has caveats — and we'll get to them. None of them should be read as a final verdict on AI's role in admissions. All three are the strongest empirical evidence we currently have.

What each paper actually studied

Paper 1: Cornell, January 2026 — "Poor Alignment and Steerability of LLMs"

Lee, Alvero, Joachims, and Kizilcec at Cornell paired 30,000 real Common App essays from one selective US engineering school (2019–2023 cycles) with 87,696 synthetic essays generated by 8 frontier LLMs — GPT-4o, GPT-4o-mini, Mistral Large, Mistral Nemo, Claude 3.5 Sonnet, Claude Haiku, Llama 3.1 70B, and Llama 3 8B.

Their headline finding: classifiers trained on the paired corpus separated human from LLM essays at F1 = 0.998 (T5) and F1 = 0.999 (TF-IDF). Even more striking, identity prompting failed to help LLMs sound more human — a classifier trained to separate identity-prompted vs. default LLM output dropped to F1 = 0.816, meaning the demographic prompts barely shifted the writing. For one subgroup, identity prompting actively reduced alignment with real student writing (the Black-applicant prompt drift was significant at t = 2.327, p = 0.020).

The mechanism: LLMs cluster tightly around abstract prompt-keyword vocabulary ("challenge," "growth," "journey," "resilience"), while human applicants reach for temporal and personal words ("year," "time," "friend," "would"). LLM-to-LLM cosine similarity sits at 0.952–0.957; human-to-human at 0.882–0.889. The machines are simply more uniform than the people. For a deeper read on which words matter most, see the lexical fingerprint study breakdown. For why "I'm a first-gen Latina" does not fix the problem, see why identity prompting fails.

Paper 2: Cornell, February 2026 — "The Digital Divide in Generative AI"

The same team plus Borchers (arXiv 2602.17791) extended the dataset by one cycle and reframed the question. They analyzed 81,663 de-identified Common App essays from the same engineering school (2019–2024), used fee-waiver status as a proxy for SES, and modeled LLM use as a continuous proportion (α̂) rather than a binary did-they-or-didn't-they.

The findings everyone cites:

Pre-ChatGPT admit gap: higher-SES 23.6% vs. lower-SES 12.9% — a 10.7 percentage-point gap.
Post-ChatGPT admit gap: higher-SES 26.3% vs. lower-SES 12.3% — a 14.0 pp gap. The gap widened by 31%.
Lower-SES applicants used LLMs more, not less. Mean estimated α̂ was 0.102 (lower SES) vs. 0.080 (higher SES) — a 28% relative difference (p < 0.001). 22.7% of lower-SES applicants vs. 18.7% of higher-SES were in the "high use" tier (α̂ > 0.13).
The penalty per unit of LLM use was 1.85× larger for lower-SES applicants. Higher-SES odds of admission dropped 60% per unit α̂ (OR = 0.40); lower-SES dropped 86% (OR = 0.14).
Linguistic convergence: by 2024, the lexical features that previously distinguished admitted from rejected essays had statistically converged — 95% confidence intervals overlap.
Only about 7% of the SES admission penalty is explained by surface stylometric features. Whatever readers (or downstream signals) are picking up on, it is not just word choice.

For what this means in practical terms — which income brackets adopted AI and why — see the income-cliff deep dive.

Paper 3: Foundry10, July 2024 — "Navigating College Applications with AI"

The Foundry10 team (Rubin, Lombard, Chen, Divanji) ran a survey of 425 US high-school teachers and 523 US teens aged 16–18 who had applied to college that cycle, fielded Feb–Mar 2024 through Sago's national panel.

The behavioral baseline:

67% of high-school students have tried text-generation AI tools; 35% never use AI for schoolwork at all.
30% of student applicants used generative AI on their personal application essays. Among those who did: 50% for brainstorming, 47% for outlines, 33% for phrasing, 32% for first drafts, 20% for final drafts, 10% for translation.
Income matters a lot. Only 22% of students from households under $50K used AI on their essays vs. 40% of those from $75K–$100K — a 150% increase in odds. The "income cliff" sits in the lower-middle, not at the very bottom. For why this matters, see the income-cliff post.
31% of teachers used AI to write college recommendation letters. 53% of those teachers said reducing stress was a primary motivator. See our writeup on the teacher rec-letter pattern.

Then the experimental piece — the one most relevant to admissions readers. Foundry10 showed respondents the same essay paragraph attached to one of three vignettes about who helped the applicant write it: ChatGPT, an admissions coach, or no one. Ratings on a 5-point scale:

Trait	ChatGPT	Coach	No help
Authenticity	3.09	3.68	3.98
Competence	3.42	3.86	4.04
Ethics	2.86	3.57	4.04
Likability	3.27	3.55	3.65

Every ChatGPT-vs-others difference was significant at p < 0.001. Notably, applicant name, race, and gender did not moderate the AI penalty — the perception cost did not vary by demographic group. For the full reader-perception breakdown, see how admissions readers evaluate AI-assisted essays.

Where the studies agree

Across very different methods, four findings converge.

1. AI use is widespread and growing. Foundry10's self-report puts adoption at 30% on the personal essay. Paper 2's continuous estimate suggests a median post-GPT α̂ of 0.043 overall (and 0.092 among detected-use applicants), with the share of essays in the "high use" tier rising sharply. Both point to a population where assuming "almost no one uses AI" is no longer credible.

2. The technical gap between human and LLM writing is large enough to detect — at least under controlled conditions. Paper 1's F1 = 0.998 is the strongest detection signal we have on real essays. Paper 2's lexical convergence finding suggests the gap between admitted and rejected essays is shrinking at the same time. Both findings point to LLMs producing a recognizable, narrow band of prose.

3. Identity does not rescue authenticity. Paper 1 shows that prompting an LLM with demographic identity does not produce text that matches the writing of real applicants from that demographic — and in some cases makes it worse. Foundry10's vignette experiment shows the "ChatGPT helped" frame depresses ratings regardless of the applicant's name, race, or gender. Both studies push back against the intuition that the "right" identity prompt or the "right" applicant profile would erase the cost.

4. The intent-vs-effect gap is real. Foundry10 finds that the most common reasons students don't use AI are preferring human-generated work (57%), ethical concerns (45%), and accuracy concerns (39%) — yet a third of applicants use it anyway. Paper 2 finds that the applicants whose AI use most plausibly reflected scarcity of other support (lower-SES, higher α̂) were the ones who paid the largest admission penalty. The gap between why people reach for AI and what AI ends up costing them is wide.

Where they disagree — or look at it differently

Foundry10 leans optimistic about democratization. Paper 2 doesn't. Foundry10 frames AI as a possible counterweight to the inequities of paid admissions coaching, citing coaching costs of $79–$349 per hour that lower-income families can't access. The implicit hope: if a $20/month chatbot can do some of what a $300/hour coach does, the playing field flattens. Paper 2 is the empirical refutation. Lower-SES applicants used AI more — and saw the admit gap widen, not narrow. The democratization story is appealing; the data so far doesn't support it.

The "real-world" detection ceiling is contested. Paper 1's F1 = 0.998 is on one-shot prompted LLM essays — the LLM is asked once for an essay and the output is used directly. Real students iteratively co-write: they generate an outline, paste in their draft, ask for a smoother transition, hand-edit, ask again. That workflow likely produces text much closer to human writing than the one-shot baseline, which is why we don't claim "detectors really work at 99.8% accuracy in the wild." See our detector reality post for what classroom and admissions detectors actually deliver.

Adoption rates depend on how you ask. Foundry10's 30% is self-report from a national survey. Paper 2's continuous α̂ is a model-estimated proportion of LLM-likeness in the text. They're measuring different things. Foundry10's number is almost certainly an undercount (people under-report behaviors they perceive as ethically gray). Paper 2's α̂ is harder to read as a "share who used AI" — it's "how LLM-like is the writing on average," which can move because of style drift even without LLM use.

Where the studies materially overlap — a caveat

This matters and we want to flag it loudly: Papers 1 and 2 come from the same Cornell research team and use Common App essays from the same single highly-selective engineering school. They are not independent corroboration. They share data, methods, and selection effects. Anything they report may not generalize to liberal-arts colleges, less selective schools, public universities with rolling admissions, or graduate programs.

Paper 2 also has a methodological caveat the authors call out themselves: the difference-in-differences analysis fails its parallel-trends assumption in 2022. The authors explicitly describe their findings as "descriptive rather than causal." We do not write "ChatGPT caused the SES gap to widen." Confounds include the pandemic test-optional shift and the 2023 SFFA v. Harvard ruling that ended race-conscious admissions — both of which independently changed admit dynamics in the same window.

Paper 3's adoption number is self-report, which on a sensitive behavior almost certainly under-counts. Foundry10 is also an advocacy-leaning organization, not a peer-reviewed venue. Some of their language reads as recommendation, not finding.

Finally: every paper is built on 2024-or-earlier data, mostly GPT-4o and Claude 3.5 Sonnet. Frontier models in 2026 produce noticeably different prose. Detection F1 numbers, lexical fingerprints, and reader-perception costs may all shift as the models do.

What we still don't know

The honest list of open questions is long.

No peer-reviewed graduate-school data. Everything published is undergraduate Common App essays from one engineering school. AMCAS, CASPA, and law-school personal statements have not been studied at this scale. Our grad-school AI policy work surfaces what schools say; the effect of AI on grad admissions remains unmeasured.
No liberal-arts college data. The single-school engineering sample selects for a STEM-leaning population whose essays may already lean technical and abstract. Liberal-arts essays may show different LLM-vs-human gaps.
No real-world iterative-prompting study. Paper 1's one-shot prompts are not how students use ChatGPT in practice. We need essays where students drafted, pasted, edited, regenerated, and submitted — and need to know how detectable that workflow is.
No admissions-officer behavioral study at scale. Foundry10's vignettes used a national online sample, not actual admissions readers reading actual files with full application context.
No causal estimate of the AI effect. Until a study can rule out confounds — test-optional, SFFA, application-volume changes — every "ChatGPT did X" claim is correlation.
No longitudinal applicant cohort. We have no data on what happens to admitted students who used AI heavily — performance, retention, or graduation outcomes.

What this means for applicants, in plain terms

None of these studies are individual-applicant prescriptions, but four observations are well-supported enough to act on:

AI-written essays are detectable enough to be risky. Even if F1 = 0.998 doesn't survive iterative prompting, the qualitative finding holds — LLM essays cluster around predictable abstractions. Submitting an essay that reads like every other AI essay is a competitive risk independent of any detector. See ChatGPT vs. real college essays.
Identity prompts don't fix the problem. Telling ChatGPT you're first-gen, low-income, or from a specific background doesn't make its output match how real applicants from that background write. Specificity comes from memory, not a system prompt.
Disclosure framing matters. When readers know AI helped, they rate authenticity, competence, and ethics meaningfully lower — so if AI use is going to be visible (through disclosure or stylistic tells), the essay needs to do enough work to overcome the perception cost.
Specific, temporal, personal language is your edge. The strongest signal in Paper 1 was that humans use words like "year," "time," "friend," "would" — the vocabulary of memory and contingency. LLMs reach for "growth" and "resilience." Write the way you remember things.

What we'll update next

We're tracking a few specific research questions and will revise this post as new findings come in:

A real-world iterative-prompting detection study. Specifically: an audit of essays produced through realistic student-LLM workflows (drafting, regenerating, hand-editing) versus one-shot generation. The detection ceiling almost certainly drops.
A multi-institution replication of Paper 2. The SES-gap-widening finding needs to be tested on at least one liberal-arts college and one large public university before it can be claimed as a general pattern.
Admissions-reader experiments with trained readers. The Foundry10 vignettes need a version run with current admissions officers reading complete files, not paragraph-level snippets in a survey panel.
Graduate-program data. AMCAS, CASPA, LSAT-feeder personal statements, and SOP corpora have all been studied informally, but no large peer-reviewed paper exists. When one does, it goes in this section.
Frontier-model retesting. Paper 1's findings on GPT-4o and Claude 3.5 Sonnet should be re-run on the 2026 generation of models. Lexical fingerprints may have shifted.

If you're a researcher working on any of these questions, we'd love to read your draft. Email is in the footer.

The bottom line

Three papers, two from the same lab, all subject to real caveats — and yet a coherent picture emerges. AI use in college admissions is widespread, asymmetric in who pays the cost, detectable in current frontier models under controlled conditions, and meaningfully penalized by readers when it's visible. The "AI levels the playing field" hope is, on the data we have, unsupported. The "AI is undetectable so don't worry" hope is, on the data we have, also unsupported.

We'll keep this post current as the literature grows. The questions are getting more interesting, not less.

Quick AI Check

See if your essay will pass university AI detection in seconds.

AI in College Admissions Research: 3 Major Studies

AI in College Admissions Research: 3 Major Studies

TL;DR — three findings worth carrying around

What each paper actually studied

Paper 1: Cornell, January 2026 — "Poor Alignment and Steerability of LLMs"

Paper 2: Cornell, February 2026 — "The Digital Divide in Generative AI"

Paper 3: Foundry10, July 2024 — "Navigating College Applications with AI"

Where the studies agree

Where they disagree — or look at it differently

Where the studies materially overlap — a caveat

What we still don't know

What this means for applicants, in plain terms

What we'll update next

The bottom line

Quick AI Check

Related Articles

Who Uses ChatGPT for College Essays? Not the Poorest Kids

Telling ChatGPT Your Identity Won't Make Essays Sound Like You

How Readers Evaluate AI-Assisted Essays — 22% Penalty

Your Essay Deserves a Second Look