Recruiting at Scale: What Happens When AI Screens 10,000 Resumes

Recruiting is one of the most consequential decisions a business makes. Who you hire shapes your culture, your capability, your trajectory. And it is also, at scale, one of the most volume-intensive and error-prone processes in any organization. When a single job posting generates 1,000 applications, the quality of your screening process determines not just who gets hired, but who gets a fair chance to be considered at all.

We have built and operated AI recruiting systems. We have processed tens of thousands of resumes through them. We have tracked outcomes. We have found things that worked, things that failed, and a few things that genuinely disturbed us enough to redesign significant portions of the system. This article is an honest accounting of all of it — not a sales pitch for AI recruiting and not a screed against it, but a practical assessment from people who have operated these systems with real candidates and real stakes.

The Initial Problem

The specific situation that drove the first system: a high-growth company with ongoing hiring across multiple roles was receiving 800-1,200 applications per open position. Their recruiting team of five people was doing initial screening manually. Throughput was roughly 50-75 applications reviewed per recruiter per day, which meant the average application waited 3-5 business days before a human saw it. Strong candidates who had applied to multiple companies were accepting other offers before anyone at this company had read their resume.

There was also a quality problem in the manual screening. Five different recruiters applying five different mental models of what constituted a qualified candidate produced inconsistent results. The same resume, reviewed by different recruiters on different days, could receive dramatically different assessments. This inconsistency was invisible in the process — no one was tracking it — but it was systematically reducing the quality of the candidate pool reaching hiring managers.

3-5 days

average wait for initial human review — meaning qualified candidates were frequently lost to faster-moving competitors before anyone made contact

Pre-deployment baseline measurement

How the System Works

The core screening function is semantic matching, not keyword matching. This is a critical distinction that is worth explaining carefully. Keyword-based screening filters for exact term matches: if the job description says "Python" and the resume says "Python," it matches; if the resume says "I built the data pipeline infrastructure" without using the word "Python," it may not. Keyword screening is fast and simple and it systematically penalizes strong candidates who describe their experience in plain language rather than using the precise vocabulary of the job posting.

Semantic matching uses embedding models to understand the meaning of experience descriptions and compare them against the meaning of job requirements, not just the surface vocabulary. A candidate who "led migration from on-premises Oracle databases to cloud infrastructure" matches the semantic intent of "cloud data engineering experience" even without using the exact phrase. This change alone, before any other system improvement, materially increased the quality of candidates surfaced.

The system also extracts structured skills data from unstructured resume text: years of experience by skill category, seniority indicators, career trajectory patterns, and specific technical certifications. This extraction is not perfect — resume formats vary enormously and parsing errors occur — but it is consistent in a way that human screening is not, and consistency at scale has value even when accuracy per-item is imperfect.

Red flag detection runs in parallel with positive matching: employment history anomalies, credential claims that don't align with described experience levels, and other patterns that experienced recruiters would flag manually. These are surfaced to human reviewers, not used to auto-reject candidates.

The Bias Problem

Let me be direct about this, because it is the most important thing in this article and the part that most AI recruiting vendors prefer to handle with marketing language rather than specific honesty.

AI recruiting systems reproduce and amplify human hiring biases unless you actively, specifically, and continuously work to counter them. This is not a hypothetical risk. It is a mathematical certainty rooted in how machine learning works. If you train a matching model on historical hiring decisions — who was hired, who was rejected — and those historical decisions reflected bias (consciously or not), the model learns to replicate that bias at massive scale. You have not automated recruiting. You have automated discrimination.

What we found in our first deployed system: the semantic matching model, trained on historical successful hires, had learned to prefer candidates from a set of roughly 15 universities — all of them expensive, predominantly attended by economically advantaged students. It had also learned a mild preference for names statistically associated with certain demographic groups, because those names had appeared more frequently in the historical "hire" category. Neither of these patterns was intentional. Both of them were real and both of them were wrong.

⚠

How We Found the Bias

We ran a structured audit: took 500 historical rejected resumes, stripped all identity signals (names, graduation years that could indicate age, school names), and re-ran them through the screening system. A statistically significant portion of previously rejected candidates scored in the "advance" range when identity signals were removed. This is not a comfortable finding. It means the previous screening process — both human and AI — had rejected qualified candidates for reasons unrelated to their qualifications. We redesigned the training data and the feature set, then ran the audit again. We run it quarterly now.

The specific changes we made: removed institutional affiliation as a direct signal in the model (replaced with accreditation tier and field relevance), stripped name-based signals entirely from the screening layer, added explicit diversity auditing to the weekly reporting dashboard, and established a manual review sample of the bottom 15% of AI scores to catch cases where the model was confidently wrong. We did not eliminate bias. We reduced it and built ongoing monitoring to catch it when it re-emerges, because it will.

💡

Bias Auditing is Not Optional

If you deploy an AI recruiting system without a formal bias auditing program, you are not running an objective process. You are running a biased process at higher velocity than before, with less human visibility into what the bias is doing. The audit is not a nice-to-have. It is the control that makes the system responsible rather than harmful.

What AI Can and Cannot Assess

AI is genuinely good at assessing some things and genuinely unable to assess others. Conflating these two categories is where AI recruiting systems fail candidates and fail organizations.

AI can assess declared skills against job requirements with good reliability. It can identify consistent career progression patterns. It can extract and structure information from unformatted text accurately at scale. It can apply scoring criteria consistently across thousands of candidates in a way no team of human reviewers can. These capabilities have real value.

AI cannot assess culture fit. It cannot assess intellectual honesty — the difference between a candidate who genuinely understands something and a candidate who has learned the vocabulary without the understanding. It cannot assess how someone functions under pressure, how they treat colleagues they disagree with, or whether they will do what they say they will do. It cannot assess growth trajectory from within a role — the person who started as a junior and became indispensable in ways that never show up on a resume title line.

The failure mode we see most often is using AI screening scores as a proxy for all of these things. A high AI match score means "this person's declared experience aligns with our requirements." It does not mean anything else. Organizations that treat it as a holistic candidate quality score are making decisions based on a measure they do not understand, and they are excluding candidates who might be their best hires.

The Candidate Experience Angle

There is a dimension to AI recruiting that does not get enough attention: the experience of the candidates moving through the system. These are people, often anxious, often with real economic need, whose careers are affected by decisions the system makes. How the system treats them matters — not just ethically, but practically, because candidate experience is a direct signal of employer brand.

The minimum standard we hold our systems to: every candidate gets a response within 48 hours of application. Rejected candidates get a specific, honest reason when possible (or an honest general category when specific feedback would expose the model's logic to gaming). Candidates who reach the interview stage and are rejected get a phone call, not a form email. Human beings are reachable at any point in the process for candidates who need them.

These are not expensive requirements to implement. They are choices about how much the organization values the time and dignity of people applying to work there. An AI system that processes 10,000 applications efficiently but leaves 9,800 people with no acknowledgment or feedback is not a good system. It is an efficient machine for communicating that the organization does not care about the people it does not hire.

Where Human Judgment Remains Irreplaceable

The final stage of any hiring decision should always be a human being who has actually talked to the candidate. No resume score, no AI assessment, no structured interview rubric replaces the judgment call that comes from a conversation — not because that judgment is always better, but because the accountability for the decision belongs with a human who can be held responsible for it.

"AI can tell you who looks qualified on paper. Only a human can tell you who you actually want working next to you, and only a human should be accountable for that answer."
Fred Lackey, DevThing LLC

AI recruiting systems, built and operated responsibly, can handle the volume problem and the consistency problem and give human recruiters more time to do the high-judgment work that actually determines hiring quality. That is a genuine and valuable outcome. But it requires discipline about where the AI's authority ends and the human's begins — and it requires ongoing honest assessment of whether the system is helping people get fair consideration or systematically preventing it.

Quarterly

bias audit cadence we recommend for any production AI recruiting system — not annually, not "when we have time." Quarterly, as a standing operational requirement.

DevThing operational standard

The question is not whether to use AI in recruiting. At scale, the alternative is not careful human review — the alternative is overwhelmed humans making fast, biased, inconsistent decisions because they have no other choice. The question is whether to use AI with the rigor the stakes demand. That rigor is available. Most vendors do not offer it. You have to require it.