Skip to main content

AI College Essays Sound Male and Privileged - Cornell Study

Cornell + Stanford analyzed ~170,000 essays: AI aligns with male, continuing-gen, and high-EC writing 65-92% of the time. GPT-4 is more skewed than GPT-3.5.

Nirmal Thacker, CS, Georgia Tech · Cerebras Systems AIMay 10, 202615 min read
Free Essay ReviewAI detection + scoring

AI College Essays Sound Male and Privileged - Cornell Study

Ask ChatGPT to write the same Common App essay for a first-generation Latina student and a continuing-generation student whose parents both went to Princeton, and the linguistic style of the output barely budges. That's because, with no demographic context at all, the model's default voice already resembles the writing of male, continuing-generation, and high-economic-connectedness applicants — anywhere from 65.5% to 92.0% of the time on significantly different LIWC features, depending on the comparison (Alvero et al., Cornell + Stanford, Sep 2024).

The twist most people don't see coming: GPT-4 is more skewed toward the privileged-sounding voice than GPT-3.5, not less. Paying for the premium model gets you an essay that reads as more privileged, not more like you. The predicted Economic Connectedness score of AI essays is 1.33 vs. an applicant-pool average of 0.84 — roughly the gap between a 50th-percentile ZIP code and a 90th-plus-percentile one (Alvero et al., Cornell + Stanford, Sep 2024).

Updated: May 10, 2026. We revise this post as new research emerges.

This post is the deep-dive on Paper 4 from our cross-study synthesis on AI in admissions research. For the companion piece on why telling ChatGPT your identity doesn't fix this, see why identity prompting fails. For the lexical-fingerprint angle on which words betray AI authorship, see the AI essay vocabulary fingerprint study.

What Cornell tested

The team — Alvero, Lee, Regla-Vargas, Kizilcec, and Joachims at Cornell, plus Antonio at Stanford — published "Large language models, social demography, and hegemony" in the Journal of Big Data (11:138) in September 2024 (DOI: 10.1186/s40537-024-00986-7). It's the largest linguistic-comparison study of AI vs. human college essays to date.

The corpus is built from three sources:

  • 35,789 Latinx in-state UC applicants from the 2016–2017 cycle, contributing 143,156 personal-insight essays. The University of California's pre-2020 four-prompt format meant most applicants wrote multiple essays.
  • 10,619 pre-ChatGPT essays from a single selective Northeastern engineering school, 2022–2023 cycles. These are the same applicant pool used in Paper 1 and Paper 2 from the same Cornell team — engineering applicants only.
  • Approximately 26,000 GPT-3.5 and GPT-4 essays generated on the same prompts, with no demographic context in the input.

Total: roughly 170,000 essays.

The method is dictionary-based stylometrics. The team ran LIWC (Linguistic Inquiry and Word Count), which scores text on 76 features — things like long-word usage, pronoun rates, emotional valence, references to time, family, body. They used Kolmogorov-Smirnov tests to identify which features significantly differed between demographic groups (male vs. female, first-gen vs. continuing-gen, low- vs. high-EC ZIPs), then asked: on those significantly different features, which side does the AI output align with?

They also ran a cosine-similarity "twin" analysis — for each AI essay, which human essay in the corpus is its closest lexical match? And they regressed essay style on Opportunity Insights' ZIP-level Economic Connectedness metric, getting R² = 0.57 — a strong signal that privilege is genuinely encoded in writing style.

The full public-school LIWC dataset is on Harvard Dataverse — a real public artifact you can re-analyze yourself if you want to verify the headline numbers.

Who AI sounds like

The cleanest summary of the findings is a table of percentages — on the LIWC features where two demographic groups significantly diverged, how often did the AI output land on the privileged-coded side?

ComparisonAI alignment with the more-privileged group
Male vs. female (significantly-different features)65.5%–79.3% MALE
First-generation vs. continuing-generation75.7%–81.1% CONTINUING-GEN
Low-EC vs. high-EC ZIP code80.0%–92.0% HIGH-EC

The range reflects which subsample (UC Latinx vs. Northeastern engineering, GPT-3.5 vs. GPT-4) and which set of features is being tallied. Even the lower end of each range is well above chance.

Two specific texture findings sit underneath the headline percentages:

  • AI uses long words ("Sixltr" in LIWC — words of 6+ letters) far more than humans, with much lower variance. That's a single feature, but it's the one that does the heaviest lifting in distinguishing AI from human writing across the whole corpus. Long words skew formal, academic, and Latinate — which historically tracks with continuing-gen and high-EC English.
  • The gender story is genuinely complex. While AI is meaningfully male-skewed on most of the significantly-different gendered features, there's a counter-pattern in the public-school subsample where AI is more aligned with female-coded writing on a subset of features. We're not telling you AI "sounds like a man." We're telling you that on the features where men and women diverge significantly, AI usually picks the male-coded option. That's a narrower, more defensible claim.

The Cornell team is careful not to call this "AI is sexist" or "AI is classist." They frame it as linguistic hegemony — the model's default register defaults to the styles of English most associated with social privilege, because those styles dominate the training data. It's a structural property of how LLMs learn to write, not a moral failing of any particular model.

"We wanted to find out what these patterns that we see in human-written essays look like in a ChatGPT world. The ways that we speak can encode and contain information about our past and who we are."

— AJ Alvero, Cornell

The twin essays: AI's closest human relatives

The cosine-similarity twin analysis is one of the most concrete findings in the paper. For each AI-generated essay, the team computed lexical similarity against every human essay in the relevant corpus and asked: who's the AI essay's closest match?

In the private-school engineering sample, roughly 95% of the human "twins" of AI essays were written by students with at least one college-educated parent. The baseline rate of at-least-one-college-educated-parent in the actual applicant pool was 82%. So the AI essays aren't just over-represented in the continuing-gen part of the pool — they're systematically pulled further into it than the pool's own demographics would predict.

The same direction held on the EC measure. AI essays' nearest human twins came from significantly higher-EC ZIP codes than the applicant pool's average. The predicted EC of AI essays from the regression model was 1.33 vs. 0.84 for the average applicant — a gap roughly equivalent to moving from the median US ZIP code to one in the 90th-plus percentile of economic connectedness (Alvero et al., Cornell + Stanford, Sep 2024).

This is the stat to carry around if you only carry one: when you ask AI to write your essay, the text it produces is closest, in lexical style, to applicants from college-educated, high-connectedness households. If you're not from that background, AI's voice is structurally far from yours by default.

Why the premium model is more skewed, not less

The most counterintuitive finding in the paper is the GPT-3.5 vs. GPT-4 comparison. The intuition almost every student starts with is that GPT-4 is "better" — more fluent, more sophisticated, harder to detect, closer to what a real person would write.

The data says GPT-4 is more skewed toward the privileged-coded style than GPT-3.5 is, on every demographic axis the team measured.

  • GPT-4 essays sit further into the high-EC end of the regression's predicted distribution than GPT-3.5 essays.
  • GPT-4's alignment percentages with the male, continuing-gen, and high-EC sides of the LIWC feature comparisons are at or near the top of each range cited above.
  • The long-word ("Sixltr") over-use is more extreme in GPT-4 output than in GPT-3.5 output.

The mechanism is straightforward in retrospect. Newer, larger models have been further RLHF-tuned toward what humans rate as "good writing" — and the humans doing the rating are disproportionately college-educated, often from high-resource backgrounds, working in English-language research and labeling pipelines. The model converges harder on the register those raters prefer. "Better writing" in the RLHF sense and "more privileged-sounding writing" are very close to the same direction in feature space.

The economic implication is sharp. Students who pay $20 a month for ChatGPT Plus get an essay that sounds more privileged than the free-tier version produces, not less. If you're an applicant on a fee waiver, paying for the premium model to "level the playing field" buys you a draft whose stylistic distance from your own voice is actually larger. The democratization story — AI as the cheap alternative to a $349/hour admissions coach — runs straight into the wall of "the cheap alternative doesn't write like you, and the expensive version of the cheap alternative writes even less like you."

Why this matters for applicants who aren't male, continuing-gen, or high-EC

This is the practical section, and we want to be useful rather than moralizing.

If you are a male, continuing-generation applicant from a high-EC ZIP code, the AI's default voice is already adjacent to yours. Brainstorming with ChatGPT, asking it for phrasing suggestions, even letting it draft a paragraph and editing — all of that lands in roughly your stylistic neighborhood. The voice cost of leaning on AI is lower for you.

If you're not — first-generation, lower-income, from a less-connected ZIP, or some combination — AI's default voice is furthest from yours. That means the homogenization layer AI adds to your draft is doing the most distortion exactly where authenticity matters the most for your reader. Admissions officers reading first-gen, low-income applications are already conditioned to look for the specific texture of your life as the strongest signal you can send. Letting AI smooth it over puts the heaviest filter on the only competitive advantage the form offers you.

That doesn't mean "never use AI." A few framings that survive the Cornell data:

  • Brainstorming and outlining are lower-risk. You're not asking the model to put words on the page — you're using it to surface questions, organize ideas, identify gaps. The text remains yours.
  • Specific paragraph polish is medium-risk. Asking the model to tighten a sentence you wrote is much safer than asking it to write the sentence. The structural privilege bias is in the draft, not in the proofread.
  • First-draft generation is highest-risk. The whole homogenization effect documented in this paper kicks in at the moment AI is doing the original composition. The earlier in your workflow you let AI write, the more the privileged-style default leaks into the finished product.

This is closely related to a separate finding we've covered: when readers know AI helped with an essay, perceived authenticity drops from 3.98 to 3.09 on a 5-point scale even when the content is identical (Foundry10, July 2024). The voice homogenization documented in this paper makes AI involvement easier to detect stylistically, which feeds the reader penalty in the Foundry10 vignette study. For more on which income brackets are using AI most — and the surprising shape of that adoption — see the income cliff in college-application AI use.

Why identity prompting doesn't rescue this

The obvious workaround is "I'll just tell ChatGPT I'm first-generation, low-income, from a Spanish-speaking household, and it'll write in that voice." The same Cornell team tested exactly that intuition in a follow-up paper. They generated essays with default un-prompted output and again with explicit demographic identity context, then measured whether the identity-prompted version moved toward how real applicants from those demographics actually write.

It didn't. A classifier separating identity-prompted from default LLM output ran at F1 = 0.816 — the dial moves a little but not much. And for one tested subgroup (Black applicants), identity prompting actively pushed the output further from real-applicant writing than the default version (t = 2.327, p = 0.020) (Cornell, Jan 2026). We unpacked the full mechanism in why "I'm First-Gen Latina" doesn't make ChatGPT sound like you.

The two papers together describe the same structural problem from different sides. The Sep 2024 paper measures where the model's default voice lives in stylistic space (in the male, continuing-gen, high-EC neighborhood). The Jan 2026 paper measures whether you can steer the model out of that neighborhood by labeling yourself in the prompt (mostly no, and sometimes it gets worse). The homogenization is a property of the model's writing distribution, not of how you ask.

Caveats

We want to flag every limitation the data has. None of these undermine the headline findings, but they shape how far you can generalize.

  • The UC sample is Latinx-only, in-state, 2016–2017. It's not "all UC applicants." It's a specific demographic subgroup at a specific moment, and the generalization to non-Latinx applicants or other UC cycles needs more work.
  • The private-school sample is engineering applicants only, at one school. Same constraint that runs through Papers 1 and 2 from this team. Liberal-arts applicants may write differently enough that some of these patterns shift.
  • Gender is recorded only as M/F binary. Non-binary, trans, and gender-non-conforming applicants are not measurable in this dataset.
  • Prompting was "out-of-the-box." The team did not supply demographic context to the LLM in the corpus generation for Paper 4. Paper 1 tested identity prompting later and found it doesn't fix the homogenization — so the absence of demographic priming here isn't a methodological gap, it's the relevant baseline.
  • Same Cornell team, with Stanford's Antonio. Three of the four papers in our research synthesis come from this lab. This is not independent corroboration of their other findings — it's a coherent research program from a single team, and that should weight how heavily any one institution can claim consensus.
  • LIWC is dictionary-based with 76 features. Multiple-comparison concerns are real. The team controls for them, but the per-feature p-values shouldn't be read as if each one is an independent test.
  • GPT-3.5 and GPT-4 only. No Claude, no Llama, no Mistral, no Gemini. We don't yet know if the privileged-style default is a property of OpenAI's RLHF pipeline specifically or a general property of frontier LLM training.
  • The 95% private-school twins figure is from the engineering sample. It may not hold at schools with very different applicant pools.

What we'll update next

A few open questions we're actively tracking, and will revise this post as evidence comes in:

  • Does the privileged-style default hold for non-OpenAI models? Claude, Llama, Mistral, and Gemini have different RLHF pipelines and different training mixes. A replication of the LIWC alignment analysis across the major frontier-model families would tell us whether linguistic hegemony is a property of LLMs in general or of one company's product line.
  • Does the Latinx-only UC finding generalize? The UC corpus is the most demographically specific in the paper. Repeating the analysis on a broader applicant pool — non-Latinx UC applicants, or applicants to a non-UC system — would tell us whether the alignment patterns shift across applicant demographics.
  • Does fine-tuning the model on first-gen or low-EC writing shift the default voice? If you fine-tune a frontier model on a corpus of writing from underrepresented applicants, do its default essays move toward that corpus's stylistic center, or does the RLHF base layer still pull it back toward the privileged register?

If you're a researcher working on any of these, we'd love to read your draft. Email is in the footer.

The bottom line

ChatGPT's default voice already sounds like the most privileged applicants in the pool — male, continuing-generation, and from high-economic-connectedness ZIP codes — at rates between 65.5% and 92.0% on significantly different LIWC features (Alvero et al., Cornell + Stanford, Sep 2024). GPT-4 is more aligned with that voice than GPT-3.5, so paying for the better model widens the gap rather than closing it. The cosine-similarity twins of AI essays in the private-school engineering sample come, 95% of the time, from households with at least one college-educated parent.

None of this is a finding about LLM intent. It's a finding about where the model's default writing distribution sits in stylistic space — and where it sits is the part of the space historically occupied by social privilege. Identity prompting doesn't pull it out of that neighborhood. If you're an applicant whose voice isn't already adjacent to that default, the practical implication is that letting AI write — rather than brainstorm with you — puts the heaviest homogenization layer exactly where your essay most needs your own texture to come through.

Use AI for the parts of the workflow that survive the data. Brainstorm with it. Outline with it. Polish a paragraph you already wrote. But the first draft, the one that decides what the essay is about and what it sounds like — that one still has to come from you.


See also: What the Research Says About AI in College Admissions: 4 Major Studies · Why "I'm First-Gen Latina" Doesn't Make ChatGPT Sound Like You · The Words That Make a College Essay Sound AI-Written · The Income Cliff in College-Application AI Use · How Admissions Readers Evaluate AI-Assisted Essays · ChatGPT vs Real College Essays · The Dumbcrafting Epidemic

Quick AI Check

See if your essay will pass university AI detection in seconds.

Related Articles

Your Essay Deserves a Second Look

Professional AI detection and comprehensive scoring before you submit

No credit card required