AI in College Admissions Research: 4 Major Studies
4 studies on AI in college essays: detection F1=0.998, SES admit gap widened 31%, AI essays lose 22% authenticity, AI writes male and privileged.
AI in College Admissions Research: 4 Major Studies
Four large studies from 2024 to 2026 — three from the same Cornell team (one with Stanford), one from the nonprofit Foundry10 — have moved the conversation about AI in college admissions out of speculation and into evidence. They report that classifiers can separate human from LLM-written essays at F1 = 0.998 (Cornell, Jan 2026), that the admit-rate gap between higher- and lower-SES applicants widened by 31% after ChatGPT's release (Cornell, Feb 2026), that when readers are told an applicant used ChatGPT, perceived authenticity drops from 3.98 to 3.09 on a 5-point scale (Foundry10, July 2024), and that AI essays linguistically align with male, continuing-generation, and high-economic-connectedness applicants — with GPT-4 even more skewed than GPT-3.5 (Cornell + Stanford, Sep 2024).
This post is the literature review. Not "do colleges check for AI" — for that, see the Turnitin truth and our 174-school policy classification. Here we ask the narrower, more useful question: what does the actual research say, where does it agree, where does it disagree, and what does it not yet know?
Updated: May 10, 2026. We revise this post as new research emerges.
TL;DR — four findings worth carrying around
- The detector ceiling is much higher than the field assumed. A T5 classifier trained on 30,000 real Common App essays paired with 87,696 synthetic LLM essays separates the two at F1 = 0.998. A simple TF-IDF baseline gets 0.999. (Cornell, Jan 2026)
- The SES admit gap widened, not narrowed, after ChatGPT. At one selective engineering school, the higher-SES vs. lower-SES admit gap went from 10.7 percentage points pre-ChatGPT to 14.0 pp post-ChatGPT — a 31% widening — even though lower-SES applicants used LLMs 28% more. (Cornell, Feb 2026)
- Telling a reader "ChatGPT helped" costs about 22% of perceived authenticity. In a controlled vignette experiment, the same essay paragraph was rated 3.98 (no help) vs. 3.68 (admissions coach) vs. 3.09 (ChatGPT) for authenticity, with similar drops for competence and ethics. (Foundry10, July 2024)
- AI essays read like male, continuing-generation, high-economic-connectedness applicants — and the premium model is more skewed, not less. Across ~170,000 essays, GPT-3.5 and GPT-4 align with male-coded, continuing-generation, and high-EC-ZIP linguistic patterns 65–92% of the time on significantly different LIWC features. 20% of UC Latinx applicants included some Spanish in their essays; 0% of the AI essays did. (Cornell + Stanford, Sep 2024)
Each of these findings has caveats — and we'll get to them. None of them should be read as a final verdict on AI's role in admissions. All four are the strongest empirical evidence we currently have.
What each paper actually studied
Paper 1: Cornell, January 2026 — "Poor Alignment and Steerability of LLMs"
Lee, Alvero, Joachims, and Kizilcec at Cornell paired 30,000 real Common App essays from one selective US engineering school (2019–2023 cycles) with 87,696 synthetic essays generated by 8 frontier LLMs — GPT-4o, GPT-4o-mini, Mistral Large, Mistral Nemo, Claude 3.5 Sonnet, Claude Haiku, Llama 3.1 70B, and Llama 3 8B.
Their headline finding: classifiers trained on the paired corpus separated human from LLM essays at F1 = 0.998 (T5) and F1 = 0.999 (TF-IDF). Even more striking, identity prompting failed to help LLMs sound more human — a classifier trained to separate identity-prompted vs. default LLM output dropped to F1 = 0.816, meaning the demographic prompts barely shifted the writing. For one subgroup, identity prompting actively reduced alignment with real student writing (the Black-applicant prompt drift was significant at t = 2.327, p = 0.020).
The mechanism: LLMs cluster tightly around abstract prompt-keyword vocabulary ("challenge," "growth," "journey," "resilience"), while human applicants reach for temporal and personal words ("year," "time," "friend," "would"). LLM-to-LLM cosine similarity sits at 0.952–0.957; human-to-human at 0.882–0.889. The machines are simply more uniform than the people. For a deeper read on which words matter most, see the lexical fingerprint study breakdown. For why "I'm a first-gen Latina" does not fix the problem, see why identity prompting fails.
"You want to sound as much like yourself — and only yourself — as possible. With AI tools, students might be shooting themselves in the foot inadvertently."
— AJ Alvero, Cornell
Paper 2: Cornell, February 2026 — "The Digital Divide in Generative AI"
The same team plus Borchers (arXiv 2602.17791) extended the dataset by one cycle and reframed the question. They analyzed 81,663 de-identified Common App essays from the same engineering school (2019–2024), used fee-waiver status as a proxy for SES, and modeled LLM use as a continuous proportion (α̂) rather than a binary did-they-or-didn't-they.
The findings everyone cites:
- Pre-ChatGPT admit gap: higher-SES 23.6% vs. lower-SES 12.9% — a 10.7 percentage-point gap.
- Post-ChatGPT admit gap: higher-SES 26.3% vs. lower-SES 12.3% — a 14.0 pp gap. The gap widened by 31%.
- Lower-SES applicants used LLMs more, not less. Mean estimated α̂ was 0.102 (lower SES) vs. 0.080 (higher SES) — a 28% relative difference (p < 0.001). 22.7% of lower-SES applicants vs. 18.7% of higher-SES were in the "high use" tier (α̂ > 0.13).
- The penalty per unit of LLM use was 1.85× larger for lower-SES applicants. Higher-SES odds of admission dropped 60% per unit α̂ (OR = 0.40); lower-SES dropped 86% (OR = 0.14).
- Linguistic convergence: by 2024, the lexical features that previously distinguished admitted from rejected essays had statistically converged — 95% confidence intervals overlap.
- Only about 7% of the SES admission penalty is explained by surface stylometric features. Whatever readers (or downstream signals) are picking up on, it is not just word choice.
For what this means in practical terms — which income brackets adopted AI and why — see the income-cliff deep dive.
Paper 3: Foundry10, July 2024 — "Navigating College Applications with AI"
The Foundry10 team (Rubin, Lombard, Chen, Divanji) ran a survey of 425 US high-school teachers and 523 US teens aged 16–18 who had applied to college that cycle, fielded Feb–Mar 2024 through Sago's national panel.
The behavioral baseline:
- 67% of high-school students have tried text-generation AI tools; 35% never use AI for schoolwork at all.
- 30% of student applicants used generative AI on their personal application essays. Among those who did: 50% for brainstorming, 47% for outlines, 33% for phrasing, 32% for first drafts, 20% for final drafts, 10% for translation.
- Income matters a lot. Only 22% of students from households under $50K used AI on their essays vs. 40% of those from $75K–$100K — a 150% increase in odds. The "income cliff" sits in the lower-middle, not at the very bottom. For why this matters, see the income-cliff post.
- 31% of teachers used AI to write college recommendation letters. 53% of those teachers said reducing stress was a primary motivator. See our writeup on the teacher rec-letter pattern.
Then the experimental piece — the one most relevant to admissions readers. Foundry10 showed respondents the same essay paragraph attached to one of three vignettes about who helped the applicant write it: ChatGPT, an admissions coach, or no one. Ratings on a 5-point scale:
| Trait | ChatGPT | Coach | No help |
|---|---|---|---|
| Authenticity | 3.09 | 3.68 | 3.98 |
| Competence | 3.42 | 3.86 | 4.04 |
| Ethics | 2.86 | 3.57 | 4.04 |
| Likability | 3.27 | 3.55 | 3.65 |
Every ChatGPT-vs-others difference was significant at p < 0.001. Notably, applicant name, race, and gender did not moderate the AI penalty — the perception cost did not vary by demographic group. For the full reader-perception breakdown, see how admissions readers evaluate AI-assisted essays.
Paper 4: Cornell + Stanford, September 2024 — "Large language models, social demography, and hegemony"
Alvero, Lee, Regla-Vargas, Kizilcec, Joachims, and Antonio, in Journal of Big Data (11:138), compared linguistic patterns of human and AI essays at scale. The corpus: 35,789 Latinx in-state UC applicants (2016–2017, 143,156 essays), 10,619 pre-ChatGPT essays from a selective Northeastern engineering school (2022–2023), and roughly 26,000 GPT-3.5 and GPT-4 essays on the same prompts — about 170,000 essays. Method: LIWC (76 features), KS tests, cosine-similarity "twin" analysis, and a regression linking essay style to Opportunity Insights' ZIP-level Economic Connectedness (EC) metric (R² = 0.57).
"We wanted to find out what these patterns that we see in human-written essays look like in a ChatGPT world. The ways that we speak can encode and contain information about our past and who we are."
— AJ Alvero, Cornell
The headline findings:
- AI uses long words (6+ letters) far more than humans, with much lower variance.
- AI aligns with male-coded writing 65.5%–79.3% of the time on significantly-different gendered LIWC features.
- AI aligns with continuing-generation writing 75.7%–81.1% of the time.
- AI aligns with high-EC-ZIP writing 80.0%–92.0% of the time. Predicted EC of AI essays = 1.33 vs. applicant-pool average of 0.84 — roughly the gap between the 50th and 90th-plus percentile of economic connectedness.
- GPT-4 is more skewed than GPT-3.5, not less. The premium model writes even more privileged.
- Among private-school applicants, ~95% of the human essays most similar to AI essays were written by students with at least one college-educated parent (vs. an 82% baseline).
- 20% of UC applicants included some Spanish in their essays. 0% of the AI-generated essays did. That zero is the most concrete cultural-erasure stat across all four papers.
Public-school LIWC features are on Harvard Dataverse; a SocArXiv preprint is also up.
Caveats: the UC sample is Latinx-only, in-state, 2016–2017; the private-school sample is engineering applicants at one school; gender is binary M/F; prompts were "out-of-the-box" with no demographic priming (Paper 1 tested that later); LIWC is dictionary-based, so multiple-comparison concerns are real; only GPT-3.5 and GPT-4 were tested; and this is the same Cornell team (with Stanford's Antonio), not independent corroboration of Papers 1 and 2.
Where the studies agree
Across very different methods, five findings converge.
1. AI use is widespread and growing. Foundry10's self-report puts adoption at 30% on the personal essay. Paper 2's continuous estimate suggests a median post-GPT α̂ of 0.043 overall (and 0.092 among detected-use applicants). Assuming "almost no one uses AI" is no longer credible.
2. The technical gap between human and LLM writing is large enough to detect — at least under controlled conditions. Paper 1's F1 = 0.998 is the strongest detection signal we have on real essays. Paper 2's lexical convergence finding suggests the gap between admitted and rejected essays is shrinking at the same time. Paper 4's LIWC analysis finds AI essays clustering tightly on long-word usage with much lower variance than humans. All three point to LLMs producing a recognizable, narrow band of prose.
3. Identity does not rescue authenticity. Paper 1 shows that prompting an LLM with demographic identity does not produce text matching real applicants from that demographic — and sometimes makes it worse. Foundry10's vignette shows the "ChatGPT helped" frame depresses ratings regardless of applicant name, race, or gender.
4. The intent-vs-effect gap is real. Foundry10 finds the most common reasons students don't use AI are preferring human work (57%), ethics (45%), and accuracy (39%) — yet a third use it anyway. Paper 2 finds the applicants whose AI use most plausibly reflected scarcity of other support (lower-SES, higher α̂) paid the largest admission penalty.
5. AI's default voice tracks privileged English. Paper 4 shows out-of-the-box LLM essays align with male, continuing-generation, and high-EC applicants — GPT-4 more so than GPT-3.5. Paper 1's "identity prompts barely help" result is the natural follow-up: the homogenization is in the model, not just the prompt. Alvero's "linguistic hegemony" framing describes what Paper 1 then measured.
"It's likely that students are going to be using AI to help them craft these essays — probably not asking it to just write the whole thing. But it's probably going to sound less like you and more like something quite generic."
— Rene Kizilcec, Cornell
Where they disagree — or look at it differently
Foundry10 leans optimistic about democratization. Papers 2 and 4 don't. Foundry10 frames AI as a possible counterweight to the inequities of paid admissions coaching, citing coaching costs of $79–$349 per hour that lower-income families can't access. The implicit hope: if a $20/month chatbot can do some of what a $300/hour coach does, the playing field flattens. Paper 2 is the empirical refutation on outcomes — lower-SES applicants used AI more and saw the admit gap widen, not narrow. Paper 4 is the linguistic refutation — the AI's default voice already encodes the writing patterns of the privileged applicants Foundry10 hopes it would help level up to. The democratization story is appealing; the data so far doesn't support it.
The "real-world" detection ceiling is contested. Paper 1's F1 = 0.998 is on one-shot prompted LLM essays. Real students iteratively co-write — generate an outline, paste in a draft, ask for a smoother transition, hand-edit, ask again — and that workflow likely produces text closer to human writing. We don't claim "detectors really work at 99.8% accuracy in the wild." See our detector reality post for what classroom and admissions detectors actually deliver.
Adoption rates depend on how you ask. Foundry10's 30% is self-report (almost certainly an undercount on a behavior perceived as ethically gray). Paper 2's continuous α̂ is a model-estimated proportion of LLM-likeness in the text, which can move because of style drift even without LLM use. The two numbers are measuring different things.
Where the studies materially overlap — a caveat
We want to flag this loudly: three of the four papers (1, 2, and 4) come from the same Cornell research team — Alvero, Lee, Kizilcec, and Joachims, with Borchers on Paper 2 and Stanford's Antonio plus Regla-Vargas on Paper 4. Papers 1 and 2 share the same single highly-selective engineering school dataset; Paper 4 uses a UC Latinx 2016–2017 corpus and a 2022–2023 sample from that same Northeastern engineering school. They share data, methods, lab tradition, and selection effects, and may not generalize to liberal-arts colleges, public universities with rolling admissions, or graduate programs. Foundry10 (Paper 3) is the only non-Cornell empirical study in this set.
Paper 2 also has a methodological caveat the authors call out themselves: the difference-in-differences analysis fails its parallel-trends assumption in 2022. The authors explicitly describe their findings as "descriptive rather than causal." We do not write "ChatGPT caused the SES gap to widen." Confounds include the pandemic test-optional shift and the 2023 SFFA v. Harvard ruling that ended race-conscious admissions — both of which independently changed admit dynamics in the same window.
Paper 3's adoption number is self-report, which on a sensitive behavior almost certainly under-counts. Foundry10 is also an advocacy-leaning organization, not a peer-reviewed venue.
Finally: every paper is built on 2024-or-earlier data, mostly GPT-4o and Claude 3.5 Sonnet. Frontier models in 2026 produce noticeably different prose, and the numbers above may shift as the models do.
What we still don't know
The honest list of open questions is long.
- No peer-reviewed graduate-school data. Everything published is undergraduate Common App essays from one engineering school. Our grad-school AI policy work surfaces what schools say; the effect of AI on grad admissions remains unmeasured.
- No liberal-arts college data. The single-school engineering sample selects for a STEM-leaning population whose essays may already lean technical and abstract.
- No real-world iterative-prompting study. Paper 1's one-shot prompts are not how students use ChatGPT in practice. We need essays where students drafted, pasted, edited, regenerated, and submitted.
- No admissions-officer behavioral study at scale. Foundry10's vignettes used a national online sample, not admissions readers reading complete files in context.
- No causal estimate of the AI effect. Until a study can rule out confounds — test-optional, SFFA, application-volume changes — every "ChatGPT did X" claim is correlation.
- No longitudinal applicant cohort. We have no data on what happens to admitted students who used AI heavily.
What this means for applicants
None of these studies are individual-applicant prescriptions, but four observations are well-supported enough to act on:
- AI-written essays are detectable enough to be risky. Even if F1 = 0.998 doesn't survive iterative prompting, the qualitative finding holds — LLM essays cluster around predictable abstractions. Submitting an essay that reads like every other AI essay is a competitive risk independent of any detector. See ChatGPT vs. real college essays.
- Identity prompts don't fix the problem. Telling ChatGPT you're first-gen, low-income, or from a specific background doesn't make its output match how real applicants from that background write. Specificity comes from memory, not a system prompt.
- Disclosure framing matters. When readers know AI helped, they rate authenticity, competence, and ethics meaningfully lower — so if AI use is going to be visible (through disclosure or stylistic tells), the essay needs to do enough work to overcome the perception cost.
- Specific, temporal, personal language is your edge. The strongest signal in Paper 1 was that humans use words like "year," "time," "friend," "would" — the vocabulary of memory and contingency. LLMs reach for "growth" and "resilience." Write the way you remember things.
The Cornell team's own framing of the practical takeaway lines up with what their data shows:
"Tools like ChatGPT can give solid feedback on writing and are likely a good idea for weak writers. But asking for a full draft will yield a generic essay that just does not sound like any real applicant."
— Rene Kizilcec, Cornell
"Use their own ideas to brainstorm and write the first draft, and, if allowed, use AI only to refine and proofread."
— Jinsook Lee, Cornell
What we'll update next
We're tracking a few specific research questions and will revise this post as new findings come in:
- A real-world iterative-prompting detection study. Essays produced through realistic student-LLM workflows (drafting, regenerating, hand-editing) versus one-shot generation. The detection ceiling almost certainly drops.
- A multi-institution replication of Paper 2. The SES-gap-widening finding needs to be tested on at least one liberal-arts college and one large public university before it can be claimed as a general pattern.
- Graduate-program data. AMCAS, CASPA, and SOP corpora have not been studied at this scale.
- Frontier-model retesting. Paper 1's findings on GPT-4o and Claude 3.5 Sonnet should be re-run on the 2026 generation of models. Lexical fingerprints may have shifted.
If you're a researcher working on any of these questions, we'd love to read your draft. Email is in the footer.
The bottom line
Four papers, three from the same Cornell lab, all subject to real caveats — and yet a coherent picture emerges. AI use in college admissions is widespread, asymmetric in who pays the cost, detectable in current frontier models under controlled conditions, meaningfully penalized by readers when it's visible, and — by default — voiced like the most privileged applicants in the pool rather than the ones most in need of help. The "AI levels the playing field" hope is, on the data we have, unsupported. The "AI is undetectable so don't worry" hope is, on the data we have, also unsupported.
We'll keep this post current as the literature grows. The questions are getting more interesting, not less.
See also: The income cliff in application AI use · The words that make a college essay sound AI-written · Why identity-prompting ChatGPT doesn't make it sound like you · How admissions readers evaluate AI-assisted essays · Teachers are using ChatGPT for rec letters · Do colleges use AI detectors? The Turnitin truth · Most colleges have no AI policy · ChatGPT vs. real college essays
Quick AI Check
See if your essay will pass university AI detection in seconds.