The Human Data Supply Chain Behind Frontier AI
Executive summary
A few years ago, "data labelling" mostly meant turning messy inputs into supervised-learning examples: draw the bounding box, classify the image, transcribe the audio, tag the document, clean the row.
That market still exists. It is just no longer where the leverage sits.
Most current market signals point to a more important workload behind frontier AI improvement. A lab finds a model weakness. Someone has to turn that weakness into a task. Someone has to write the rubric. Someone has to find qualified contributors. Someone has to decide whether a response is merely fluent or actually correct. Someone has to review contested answers, prevent leakage, measure acceptance rates, and ship usable data back into post-training or evaluation loops quickly enough to affect the next model run.
That is not generic labelling. It is a human judgment supply chain.
The strongest companies in this market are less likely to be the cheapest labor vendors or the largest pools of workers. They are more likely to be operators that can convert scarce, trusted, domain-specific human judgment into accepted model improvement faster than many buyers can build the same system internally.
That changes the map. Scale AI remains the visible incumbent because it positioned early around data infrastructure for model builders through products such as Data Engine and RLHF: Scale: Data Engine and Scale: Rlhf. But the reported Meta stake in Scale exposed the core risk: neutrality is now part of the product. Reuters reported that Google planned to split from Scale after Meta's investment because rival labs worried about strategic and technical leakage: Reuters: Google Scale Ais Largest Customer Plans Split After Meta Deal Sources Say 06 13.
That creates room for challengers with different wedges. Mercor is trying to own expert supply and routing: Mercor. Handshake AI is trying to convert a student and graduate employment network into AI data operations: Joinhandshake: Introducing Handshake AI. Turing is using a technical-talent and services base to sell LLM training and synthetic-data workflows: Turing: LLM Training And Development. Surge AI appears to be a quality-managed operator in data labelling and RLHF, though public primary evidence is thinner than for some peers: Surgeai: Data Labeling. Labelbox, SuperAnnotate, Braintrust, and internal lab tools compete around workflow control, contributor management, and quality systems.
The market appears attractive because model teams continue to need better human signal. It is also fragile because the most valuable work may be internalized, tightly secured, or partly automated over time. The better profit pool is less about "hours of annotation" and more about task design, expert routing, quality control, security posture, neutrality, and measurable accepted-output economics.
What changed
The old data-labelling business was about producing labeled examples at scale.
The new human-data business is about producing trusted judgments that change model behavior.
That difference sounds semantic until you look at the workflow. In commodity annotation, the buyer can often define the task clearly: label the object, classify the text, tag the image, transcribe the clip. Quality matters, but the task is usually legible before the vendor starts.
In frontier-model work, the task itself is often the hard part. The buyer may start with a vague model failure: the model gives unsafe medical advice, misses edge cases in code review, reasons poorly about tax scenarios, over-refuses harmless requests, fails a multilingual domain, or performs well on public benchmarks but badly on internal evals.
Before any contributor can work, someone has to translate that weakness into a data program:
- What should the task ask?
- What does a good answer look like?
- Who is qualified to judge it?
- What disagreement is acceptable?
- Which examples are contaminated or too easy?
- Which outputs should become training data, eval data, red-team material, or discarded noise?
- What evidence proves the work improved the targeted failure class?
That is why the market moved up-stack. The scarce asset is not raw human time. It is reliable human judgment packaged into a repeatable operating system.
Why now
The timing comes from three reinforcing shifts:
- Frontier model teams now spend more energy on post-training quality and evaluation loops, where human judgment remains a practical bottleneck.
- AI product teams increasingly need continuous feedback pipelines rather than one-off dataset projects.
- Strategic trust risks around model-data vendors, highlighted by recent ownership dynamics, made neutrality and security more central in buyer decisions.
The value chain
The human data supply chain starts with a model-improvement need and ends with accepted data that a buyer can use.
The first step is failure definition. A lab, AI product team, or enterprise identifies a behavior it wants to improve or measure, for example factuality in a domain, code correctness, refusal behavior, tool-use reliability, math reasoning, safety boundaries, multilingual nuance, or customer-specific accuracy.
The second step is task and rubric design. This is the most underestimated control point. A weak rubric creates noisy data even with strong contributors. A strong rubric narrows ambiguity, defines edge cases, gives reviewers a common language, and makes disagreement measurable.
The third step is contributor sourcing. Different tasks need different people: generalist annotators, trained reviewers, software engineers, doctors, lawyers, finance analysts, STEM graduate students, bilingual specialists, safety testers, or domain experts. The vendor has to find, qualify, route, and retain them.
The fourth step is production. Contributors label, rank, critique, solve, red-team, write, compare, or evaluate model outputs.
The fifth step is quality control. This is often where margin quality is won or lost. Work may need automated checks, gold tasks, peer review, expert adjudication, calibration rounds, fraud detection, rework loops, and buyer-side acceptance gates.
The sixth step is delivery into training or evaluation systems. The output needs metadata, provenance, versioning, and enough structure to be usable downstream.
The seventh step is feedback. Accepted and rejected work should update contributor scores, reviewer rules, routing, rubrics, and future task design.
The real product is this loop. A vendor that mainly supplies labor is more exposed. A vendor that owns more of the loop can start to look like infrastructure.
Operating mechanics that decide outcomes
Depth in this market usually shows up in operating mechanics rather than top-line positioning.
Rubric quality is one example. A weak rubric often creates high disagreement and expensive rework, even with strong contributors. A stronger rubric usually defines boundary cases, escalation criteria, and explicit adjudication rules up front.
Reviewer calibration is another example. Teams that run periodic calibration rounds, track reviewer drift, and enforce contributor-to-reviewer routing rules tend to produce more stable accepted-output rates than teams that rely on one-pass throughput.
Fraud and contamination controls also matter. As expert-task pay rises, incentives for low-quality or misrepresented work rise as well. Vendors with stronger qualification checks, provenance tracking, and dispute workflows generally have a better chance of preserving buyer trust.
Finally, accepted-output measurement discipline is a practical differentiator. Programs that track acceptance, rework, adjudication load, and turnaround by task family can usually iterate operating design faster than programs that only report volume.
The economics: accepted output, not hourly cost
One common buyer mistake is comparing vendors only by hourly rate or gross task volume.
In many programs, a better unit is cost per accepted output.
A $40/hour contributor who produces accepted work quickly may be cheaper than a $12/hour contributor whose work requires rework, escalation, and review. A vendor with a higher headline price may be cheaper if its task design produces lower disagreement, higher acceptance, and faster delivery.
The operating metrics that matter are:
- Cost per accepted task by task family.
- Rework rate.
- Disagreement rate.
- Reviewer load per accepted item.
- Time to accepted output.
- Contributor acceptance rate by cohort.
- Fraud or low-quality work rate.
- Eval delta or model-lift proxy on the targeted failure class.
That is why the business can look like services on the surface but behave more like infrastructure underneath. The buyer is purchasing throughput through a quality system, not labor by itself.
For vendors, margins may depend less on worker count than on accepted-output efficiency. If expert compensation rises, reviewer load grows, and buyer rejection rates increase, a marketplace can drift toward thin-margin labor-broker economics. If task design, routing, and QA improve accepted-output yield, a vendor may sustain reliability-based pricing rather than time-based pricing.
That is a core margin question for Mercor, Handshake AI, Turing, Surge, Scale, and the rest of the market: can they turn human judgment into a managed system, or are they mainly reselling labor into a temporary demand spike?
Buyer and budget owner
The buyer is not one persona.
In frontier labs, budget can sit with post-training teams, data teams, eval teams, safety teams, research operations, or product-model teams. The work is close to model development, so the spend behaves like performance infrastructure.
In enterprises, the buyer may be AI platform, product engineering, data science, compliance, customer operations, or a business unit building domain-specific AI. These buyers often start with a narrow dataset or eval need, then discover that the harder problem is an ongoing human feedback loop.
In regulated or government contexts, procurement may care more about auditability, security posture, worker access, geography, and confidentiality than raw throughput.
This split matters because each buyer optimizes for something different. Frontier labs usually care about speed and quality under secrecy. Enterprise buyers usually care about domain reliability and integration. Regulated buyers usually care about controls. A single vendor can claim to serve all three, but the operating model is rarely identical.
Company archetypes
The companies in this market are often described as if they all compete in one bucket. That hides the real differences.
Incumbents and challengers
At a high level, Scale AI remains the incumbent center of gravity because of its footprint and infrastructure positioning. The main challenger lanes are:
- Managed quality specialists (for example, Surge AI).
- Expert-supply and routing networks (for example, Mercor and Handshake AI).
- Technical talent plus managed services hybrids (for example, Turing).
- Tooling and control-layer vendors (for example, Labelbox, SuperAnnotate, Braintrust, and internal stacks).
Scale AI: the infrastructure incumbent with neutrality risk
Scale has the strongest market memory. It is the company most associated with turning data operations into AI infrastructure. Its Data Engine and RLHF positioning show the right ambition: own the data loop around model improvement rather than operate as a task marketplace.
The problem is trust. Reuters reported that Meta's large stake in Scale raised concerns among competing labs, including Google: Reuters: Google Scale Ais Largest Customer Plans Split After Meta Deal Sources Say 06 13. Whether every reported customer move plays out exactly as described matters less than the lesson: when the buyer's model roadmaps, failure modes, safety work, and technical blueprints are embedded in a vendor workflow, neutrality becomes a product feature.
Scale's upside case is that it remains the trusted infrastructure partner for enough buyers and moves deeper into secure, integrated, high-quality data workflows. The risk is that ownership concerns turn an incumbent advantage into a customer-concentration and neutrality problem.
Surge AI: the quality-managed specialist
Surge is harder to evaluate from public primary material. Its site clearly positions around AI training data and data labelling: Surgeai and Surgeai: Data Labeling. Market perception often places it in the high-quality managed-operations camp, while financial-scale claims are mostly visible through reported coverage rather than audited disclosure, for example Reuters/U.S. News reporting on funding discussions: Usnews: Surge AI Seeks Up To 1 Billion In Funding After Revenue Tops 1 Billion Sources S.
The question is whether Surge has a quality system buyers can verify, or whether it is a strong services operator in a market where public proof points are limited. If Surge can prove better accepted-output economics, it can win even without the broadest narrative. If it cannot, it risks being grouped with other managed vendors.
Mercor: the expert-routing network
Mercor's wedge is expert supply. Its public positioning is about organizing human intelligence for AI work: Mercor. Reported third-party coverage frames the company as moving from recruiting toward expert data supply for AI labs, with valuation and pay claims in private-market reporting: Techcrunch: Mercor Quintuples Valuation To 10b With 350m Series C and Techcrunch: How AI Labs Use Mercor To Get The Data Companies Wont Share. Frontier AI work increasingly needs people who can judge hard tasks, not merely complete simple labels.
The upside is clear: if Mercor can source, qualify, and route scarce experts faster than labs can do internally, it becomes a critical supply layer.
The risk is also clear: expert marketplaces are expensive to operate. Quality can drift as supply scales. Fraud and credential inflation become real problems. Expert wages can rise faster than vendor pricing. The business only gets interesting if Mercor owns routing, quality, and buyer trust. Access to a list of people is not enough.
Handshake AI: distribution-first expert supply
Handshake AI is interesting because it starts with distribution. Its launch material says it uses its network to source graduate-level experts and manage quality/data production for AI labs: Joinhandshake: Introducing Handshake AI. Its expert program markets flexible AI work for experts: Joinhandshake: Expert AI. Additional support documentation and reported interviews reinforce the same positioning around graduate/STEM expert workflows: Support: Introduction To The Handshake AI Program and Business Insider: Handshake Ceo AI Training Evolving Generalists To Stem Experts Pay 7.
That gives Handshake a plausible supply advantage in student, graduate, and early-career expert labor. The question is whether that supply advantage translates into reliable output for hard model-improvement work. A large network does not automatically become a quality system. Handshake has to prove qualification, routing, review, and retention in domains where wrong answers are expensive.
Turing: technical talent plus managed AI services
Turing sits between expert network and managed services. It markets LLM training and synthetic-data services: Turing: LLM Training And Development and Turing: LLM Synthetic Data Training. Reuters reported that Turing tripled revenue to $300 million and had access to a large expert network including developers and PhDs: Reuters: AI Data Startup Turing Triples Revenue 300 Million 01 28.
Turing starts with technical labor credibility. It may be especially relevant where tasks require software engineering, code review, reasoning, or technical evaluation. Its risk is breadth. A company that sells many AI services has to show that human-data workflows are a repeatable operating system, not one more services line.
Labelbox, SuperAnnotate, Braintrust, and the control layer
Not every buyer wants a black-box managed service. Some want tooling, workflow control, data lineage, annotation interfaces, model-assisted review, or verified contributor networks.
That is where Labelbox, SuperAnnotate, Braintrust, and internal tooling matter. The control layer becomes more attractive when buyers want their own process, their own security boundaries, or their own contributor oversight. The risk for standalone tooling is that supply access and managed execution may matter more than the interface. The risk for managed vendors is the reverse: buyers may decide the highest-value workflows are too strategic to outsource blindly.
The tension: outsource, internalize, or automate
The market's biggest uncertainty is less about whether human data matters and more about which slices stay external.
Labs have strong reasons to outsource:
- faster access to specialized supply,
- burst capacity,
- operational complexity,
- lower fixed cost,
- outside workflow expertise,
- coverage across many domains.
They also have strong reasons to internalize:
- confidentiality,
- contamination control,
- direct feedback to researchers,
- protection of future model capabilities,
- better integration with internal evals,
- lower strategic dependency.
The likely outcome appears mixed. Commodity and bursty work can stay outsourced. Some expert work can stay outsourced when supply access matters. Crown-jewel workflows, especially those tied to frontier capabilities, safety policy, and unreleased model behavior, may move inside labs or into heavily controlled vendor environments.
Automation adds another pressure. Synthetic data, model-generated tasks, automated graders, and evaluator models will compress low-end work. But automation does not simply remove humans. It often moves humans upward: designing rubrics, auditing model-generated data, adjudicating hard examples, testing failures, and validating evals.
The vendor that survives automation is not the one clinging to manual tasks. It is the one that orchestrates humans and automation into a better quality system.
Where profit and control accrue
A practical view is that control may accrue in five places.
First, task design. The vendor that can turn a model failure into a useful data program has leverage before labor begins.
Second, expert qualification. If a vendor can prove who is actually qualified for a task, it can charge for trust.
Third, routing and review. Matching work to the right contributor and reviewer is where quality and margin are decided.
Fourth, security and neutrality. Buyers need to trust that their model outputs, roadmaps, customer data, and failure modes will not leak.
Fifth, integration into eval and post-training loops. The closer the vendor gets to the buyer's model-improvement cycle, the harder it is to replace.
The weakest profit pool is typically undifferentiated annotation labor. The strongest pool is more likely recurring, high-trust, high-complexity human-data infrastructure where accepted-output economics are visible and switching costs are operational.
Regulation and risk
There is no single "data labelling regulation" that defines the market. The risk surface is more practical:
- labor classification,
- contractor compliance,
- cross-border data transfer,
- privacy,
- IP ownership,
- confidentiality,
- export-control-adjacent sensitivity,
- regulated domain work,
- worker access to sensitive model outputs,
- provenance and contamination,
- auditability.
For frontier labs, the trust risk can be as important as formal compliance. If a vendor sees prompts, model failures, safety policies, evaluation targets, unreleased behaviors, or training priorities across competing labs, the vendor sits near sensitive competitive information.
That is why the Scale/Meta episode matters beyond Scale. It shows how quickly a human-data vendor can become a trust boundary.
Winners, losers, and archetypes
Likely winners:
- Neutral expert-data operators with strong QA, security, and accepted-output metrics.
- Expert networks that can actually verify and route scarce talent.
- Managed operators that prove better yield, not only more throughput.
- Tooling/control layers that help buyers keep sensitive workflows in-house.
- Hybrid systems that combine software, human operations, and automated checks.
Likely losers:
- Low-cost annotation shops without quality differentiation.
- Marketplaces that cannot police contributor quality.
- Vendors dependent on one major buyer or platform.
- Tools without supply, workflow gravity, or integration into post-training/eval loops.
- Any company selling "more labels" when the buyer needs better judgment.
Upside case
The upside case is that frontier AI turns human data into a lasting infrastructure market.
Models keep improving, but the evaluation frontier keeps moving. Harder models create harder failure modes. Enterprises need domain-specific accuracy. Labs need fresh evals, safety tests, preference data, expert reasoning, and adversarial review. Automation handles more simple work, but humans move into higher-value judgment, rubric design, and audit.
If that happens, the strongest vendors become part of the model-improvement stack. They are trusted operating partners for post-training and evaluation, not BPO providers with an AI wrapper.
Risks and constraints
The main risk is that the category is a demand spike with weak long-term margins.
Labs internalize the most valuable work. Automated evaluators reduce paid human volume. Expert wages rise. Buyers multi-source vendors. Low-end annotation gets compressed. Managed services start looking like staffing. Tooling gets bundled into broader AI platforms. Neutrality issues make buyers cautious about sharing sensitive workflows.
If that happens, some companies still survive, but the market is smaller and less defensible than current funding narratives suggest.
What would change the thesis
The external-vendor case gets stronger if major labs publicly commit to long-running third-party expert-data programs, especially with evidence of better quality-adjusted outcomes than internal teams.
The expert-network case gets stronger if Mercor, Handshake AI, Turing, or similar players publish credible acceptance-rate, quality, retention, or eval-impact proof points by domain.
The tooling/control-layer case gets stronger if more enterprises and labs choose to run human-data programs internally while buying software for routing, QA, provenance, and review.
The case gets weaker if automated evaluator systems repeatedly replace paid human loops in complex domains without quality regressions, or if labs materially reduce third-party human-data spend for several model cycles.
The case also gets weaker for any vendor that cannot explain its accepted-output economics. Worker count, GMV, and task volume are not enough.
What to watch next
Does Scale's neutrality issue become a one-time disruption, or a lasting opening for neutral vendors?
Can Mercor and Handshake AI maintain quality as they scale expert supply?
Does Turing turn technical talent into a repeatable post-training data system, or stay closer to a broad services bundle?
Surge needs stronger public proof around quality-managed execution.
The biggest signal is whether labs build more internal data and eval operations.
Look for buyers asking vendors for accepted-output metrics instead of staffing capacity.
Also watch whether the tooling layer becomes more important as sensitive workflows move inside the buyer's boundary.
The old label was data labelling. The newer market is closer to human judgment infrastructure. That may be a better market, but it is also harsher. It rewards trust, quality systems, and operating discipline. Vendors still selling labor volume will feel the pressure first.
Sources
- Scale Data Engine: Scale: Data Engine
- Scale RLHF: Scale: Rlhf
- Scale: Scale
- Reuters on Google planning to split from Scale after Meta deal: Reuters: Google Scale Ais Largest Customer Plans Split After Meta Deal Sources Say 06 13
- Reuters on Meta and Scale AI deal scrutiny: Reuters: Metas 148 Billion Scale AI Deal Latest Test AI Partnerships 06 13
- Mercor: Mercor
- Handshake AI launch: Joinhandshake: Introducing Handshake AI
- Handshake AI labs: Joinhandshake: Labs
- Handshake expert program: Joinhandshake: Expert AI
- Turing LLM training and development: Turing: LLM Training And Development
- Turing synthetic data training: Turing: LLM Synthetic Data Training
- Reuters on Turing revenue and expert network: Reuters: AI Data Startup Turing Triples Revenue 300 Million 01 28
- Surge AI: Surgeai
- Surge AI data labelling: Surgeai: Data Labeling
- Braintrust human data: Usebraintrust: Human Data
- Labelbox: Labelbox
- SuperAnnotate: Superannotate