$70 million is already chasing the least glamorous part of robotics: collecting enough robot training data to make physical AI work outside a demo room.

XDOF Wrings $70M From Dirty Robot Training Data Race
XOOMAR Intelligence
Analyst Take
That is the real signal behind XDOF, the startup emerging from stealth with backing from Thrive Capital, Spark Capital, a16z, Lux, and WndrCo, according to TechCrunch. The company is betting that the next bottleneck in robotics won’t be chips or model architecture. It will be the dirty data loop: collecting, cleaning, annotating, evaluating, and repeating physical interactions at scale.
The timing matters. Two weeks ago, OpenAI said it would relaunch the robotics program it shuttered in 2021. That move fits a broader push among frontier labs to make AI useful in the physical world. But unlike LLMs, which trained on oceans of text already sitting online, robots don’t have a web-scale archive of useful motion, force, manipulation, and embodied interaction data.
XOOMAR analysis: XDOF is selling the shovel in a race where everyone wants to mine “physical AI,” but few want to build the warehouse, hire the operators, calibrate the machines, and label the footage.
XDOF robot training data starts with 20 customers and a $70 million war chest
XDOF, pronounced “ecks-doff,” has about 60 employees and is already working with 20 customers, including several frontier AI labs, co-founder and CEO Philippe Wu told TechCrunch. He declined to name them.
“All of the top labs are trying to pursue robotics,” Wu said. “We’ve already seen some of the downfalls of falling a little bit behind in the language model race … you don’t want to be in this type of situation where you pursue this technology too late, and everyone is in this boat where physical AI is the next frontier.”
That quote says more than a launch announcement. AI labs don’t want to repeat the LLM race from behind. If robot foundation models become strategically important, access to high-quality XDOF robot training data could become as important as access to GPUs or research talent.
Wu’s own path explains the company’s thesis. As a PhD student at UC Berkeley, he worked on enabling robots to learn skills from large-scale datasets. The blocker was not theory. It was supply.
“We didn’t have large-scale data to work with,” Wu told TechCrunch. “There was this chicken-and-egg problem — we first needed to actually collect data before we could even ask how to train a foundation model for robotics.”
That gap became the business.
130,000 trajectories show how scarce good robotics data still is
XDOF is partnering with UC Berkeley’s AI Research lab to release ABC, a dataset the company believes is the largest collection of high-quality robot training data ever assembled.
It includes:
- 130,000 trajectories of robot manipulation data
- 300 hours of simulation
- 100 hours of evaluations
The team has already used the data to train robots on benchmark tasks such as folding T-shirts, flattening boxes, and loading AirPods into their cases.
That is meaningful scale for robotics. It also shows the brutal mismatch with language AI. Text and image models benefited from data that already existed in public or semi-public digital form. Robot data has to be produced through physical action. Every example may involve hardware, space, objects, cameras, human operators, maintenance, calibration, and later annotation.
| Data source | Why it scales | Why robotics is harder |
|---|---|---|
| Text and images | Large stores already existed online | Physical interaction is not naturally archived at useful fidelity |
| YouTube and gig footage | Easy to collect in volume | TechCrunch says it can be low-fidelity and hard to reconcile with the physical world |
| Robot teleoperation | Produces targeted demonstrations | Requires robots, operators, calibration, and task setup |
| Simulation | Can generate variation | Still needs real-world grounding and evaluation |
XOOMAR analysis: the numbers around ABC are not just a flex. They expose the supply chain problem. If 130,000 trajectories is release-worthy scale, then the industry is still early in building the equivalent of a serious physical data layer.
That echoes a broader pattern in AI infrastructure. Investors don’t only fund glamorous models. They also fund the systems that make models testable, auditable, and usable, a theme we covered in $27M Bet Pushes Pramaana Labs to Make AI Prove Itself.
Clean robot demos don’t solve messy physical work
Robotics demos are persuasive because they compress difficulty into a clean clip. But XDOF’s business exists because deployment requires more than a robot completing a task once under controlled conditions.
TechCrunch reports that Wu and XDOF co-founder and CTO Fred Shentu previously worked on GELLO, a low-cost teleoperation system that lets a human operator control a robotic arm to generate training data. Wu said the paper became influential because “a lot of people had similar needs and bottlenecks.”
The bottleneck is not just data volume. It’s data fit.
XDOF plans to operate across three tiers of a data pyramid:
- Robot-specific teleoperation: data collected on the actual robot being deployed
- General teleoperated robot data: systems like GELLO collecting broader manipulation examples
- Egocentric human data: humans performing everyday tasks, captured through wearable sensors XDOF plans to build
Wu’s point on hardware choice is sharp because it cuts against the idea that any footage will do.
“Your camera choice is going to affect the quality of your data — which is going to affect how your hand-tracking algorithm performs,” Wu said. “If you don’t design the hardware well from the start, the data you collect might have very specific problems that you didn’t anticipate.”
XOOMAR analysis: this is where XDOF’s value moves beyond “data vendor.” If the company can shape collection tools, annotation systems, and evaluation workflows together, it can sell a feedback loop, not just a dataset.
Robotics wants a shared-dataset moment without an internet-sized shortcut
David McAllister, a Berkeley PhD student who helped organize the ABC release, framed the academic upside directly.
“We’ve seen in language, image generation, and other fields, that when models and data are released, the community achieves things that you wouldn’t necessarily have expected,” McAllister told TechCrunch.
That is the optimistic case for ABC. Shared data can create unexpected research gains. It can also make benchmarks less anecdotal and more comparable.
But robotics faces a harsher scaling curve than software-native AI. The web did not accidentally record enough high-quality robot manipulation data. Companies have to manufacture it. That means people operating robots, people wearing sensors, people maintaining equipment, and people deciding what counts as useful training material.
Wu is blunt about why major labs may outsource this work.
“You need a warehouse of hundreds of thousands of square feet with hundreds of robots,” Wu said. “You need to maintain these robots, calibrate their physical parameters, and properly train operators.”
That is not a typical research lab function. It sounds closer to logistics, workforce management, and data operations. The physical AI race may be won partly by whoever can turn that operational grind into repeatable infrastructure.
A related labor question is already visible in other automation pushes. As we reported in 500 Bowls an Hour Pits Wonder Robot Kitchen Against Labor, robotics stories often become labor stories once machines leave the demo floor.
AI labs, robot makers, workers, and customers are pulling on the same data pipeline
For frontier labs, outsourced robot training data offers speed. They can pursue robotics without first building a giant physical data operation from scratch.
For robot companies, better data could make models less brittle across tasks. TechCrunch notes that XDOF is not focused only on data provision. It is also building data cleaning, tooling, and annotation systems, which are meant to create a self-reinforcing loop for robot trainers.
For workers, XDOF’s model points to a new labor layer in AI. The company plans to hire and train armies of teleoperators and egocentric data operators around the world. That work may be repetitive. It may also become essential, much like labeling work became essential to earlier AI systems.
For customers, the issue is trust. If outsourced datasets shape how robots behave, buyers will eventually care about where the data came from, how it was evaluated, and whether the model’s performance translates to their own environment. That is XOOMAR analysis, but it follows directly from XDOF’s focus on collection quality, hardware design, annotation, and evaluation.
The name XDOF captures the ambition. It plays on “degrees of freedom,” the robotics term for independent motions a robot can perform. TechCrunch notes that a human arm from shoulder to wrist has seven degrees of freedom, while Figure.AI’s latest robot has 30.
Wu said the “X” means: “Arbitrary degrees of freedom, unlimited degrees of freedom.”
Proprietary robot training data may become the moat
Hardware alone is unlikely to be enough if competitors can buy similar components and train similar models. XOOMAR analysis: the harder moat may be proprietary embodied data, specialized collection systems, and evaluation loops tied to real tasks.
XDOF appears built around that logic. It is not merely collecting footage. It wants to own the pipeline around data collection tools, data cleaning, annotation, and feedback for model trainers.
That matters because physical AI has less room for vague claims. A chatbot can fail softly. A robot fails in space, around objects, people, equipment, and time-sensitive workflows. The source material does not give safety incident data, so we should not overstate the risk. But the operational stakes are plainly different when the model controls hardware.
The risk for smaller robotics teams is also clear. If the best data pipelines are expensive, labor-intensive, and tied up by frontier labs or large robotics companies, smaller players may face a harder path. They may depend more on limited datasets, simulation, or narrow in-house collection.
That does not mean XDOF wins by default. It means the market is moving toward a harder question: who controls the physical data layer?
XDOF-style data factories could decide which robots leave the demo floor
The next phase to watch is not whether robotics labs can produce better videos. It is whether they can build or buy repeatable data systems that improve models across real tasks.
Evidence that would strengthen XDOF’s thesis includes more named frontier lab customers, broader adoption of ABC, measurable gains on benchmark tasks, and proof that its three-tier data pyramid improves model performance beyond isolated demonstrations.
Evidence that would weaken it would be just as important: if simulation reduces the need for real-world collection faster than expected, if labs decide to build giant internal data operations, or if XDOF’s datasets fail to generalize beyond the environments where they were collected.
For now, the signal is clear. XDOF robot training data is turning unglamorous physical work into AI infrastructure. The companies that master that grind may shape embodied AI more than the ones with the slickest launch clips.
The Bottom Line
- Robotics may be entering a new race where proprietary physical-world data becomes a key advantage.
- XDOF’s early customer traction suggests major AI labs are outsourcing the hardest parts of robot training.
- OpenAI’s robotics relaunch shows frontier labs increasingly see physical AI as the next major battleground.
LLM Training Data vs. Robot Training Data
| Area | LLM Training | Robot Training |
|---|---|---|
| Data availability | Large volumes of text already existed online | Useful motion, force, manipulation, and embodied interaction data is scarce |
| Core workflow | Train on digital text datasets | Collect, clean, annotate, evaluate, and repeat physical interactions |
| Main bottleneck | Model and compute race | Real-world data collection at scale |
XDOF Launch Metrics
Sources
Written by
XOOMAR Insights Team
Research and Editorial Desk
The XOOMAR Insights Team pairs automated research with human editorial judgment. We track hundreds of sources across technology, fintech, trading, SaaS, and cybersecurity, cross-check the facts, and explain what happened, why it matters, and what to watch next. We do not just rewrite headlines. Every article is fact-checked and scored for reliability before it goes live, and we link back to the original sources so you can verify anything yourself.
Explore More Topics
Related Articles
TechnologyHidden Fees Warp LLM API Pricing Beyond Token Costs
Token prices are only the opening bid. Context, caching, retries, tools, and latency tiers can decide the real LLM API bill.
TechnologyBest Data Room Software to Stop Fundraising Chaos Fast
The right startup data room can calm investor diligence, control sensitive files, and show founders exactly who's engaged.
TechnologyInvestor Checks Ride on Startup Data Room Software
Founders don't need the priciest VDR first. The right data room depends on stage, sensitivity, analytics, and investor access.
TechnologyStartup Data Room Checklist Investors Won't Pick Apart
A clean startup data room speeds due diligence, reduces investor friction, and keeps sensitive fundraising files under control.
TechnologyNo-Code RAG Chatbot Turns Internal Docs Into Answers
No-code RAG chatbot tools let teams query internal docs without building a Python pipeline, but setup and testing decide success.
CybersecurityPassword Manager vs Browser Password Manager, Who Wins?
Browser tools beat password reuse, but dedicated password managers offer safer vaults, sharing, audits, and recovery.
CybersecurityVPN Kill Switch Blocks IP Leaks When Tunnels Drop Suddenly
A VPN kill switch cuts internet access when your VPN drops, blocking IP, DNS, and traffic leaks until the tunnel returns.
CybersecurityPrivacy Toolkit Locks Down Everyday Browsing Without Pain
A practical privacy toolkit cuts trackers, search profiling, IP leaks, weak passwords, and fingerprinting without making the web unusable.
TechnologyWhite House Forces Anthropic Fable Shutdown in AI Feud
White House restrictions forced Anthropic Fable offline, exposing an AI policy process shaped by leaks, politics, and safety claims.
CybersecurityBest Antivirus for Low-End PCs That Won't Choke Windows
Low-end PCs need antivirus that protects without eating RAM. Defender, Bitdefender, ESET and Panda stand out if tuned right.
Don't miss the signal
Get our weekly roundup of the stories that matter across tech, fintech, and trading. No noise, just signal.
Free forever. No spam. Unsubscribe anytime.