XDOF works on robot training data for physical AI, focusing on collecting, cleaning, annotating, evaluating, and repeating physical interaction data at scale.

How many customers does XDOF have?

XDOF has 20 customers, including several frontier AI labs, according to co-founder and CEO Philippe Wu.

What is the ABC robotics dataset?

ABC is a robot training dataset XDOF is releasing with UC Berkeley’s AI Research lab. It includes 130,000 robot manipulation trajectories, 300 hours of simulation, and 100 hours of evaluations.

Why is robot training data harder to scale than LLM training data?

Unlike text used for LLMs, robot training data does not already exist at web scale. It must be produced through physical actions involving hardware, objects, cameras, operators, calibration, and annotation.

XDOF Wrings $70M From Dirty Robot Training Data Race

Q: What robot tasks has XDOF’s data been used for?

The data has been used to train robots on benchmark tasks such as folding T-shirts, flattening boxes, and loading AirPods into their cases.

That is the real signal behind XDOF, the startup emerging from stealth with backing from Thrive Capital, Spark Capital, a16z, Lux, and WndrCo, according to TechCrunch. The company is betting that the next bottleneck in robotics won’t be chips or model architecture. It will be the dirty data loop: collecting, cleaning, annotating, evaluating, and repeating physical interactions at scale.

The timing matters. Two weeks ago, OpenAI said it would relaunch the robotics program it shuttered in 2021. That move fits a broader push among frontier labs to make AI useful in the physical world. But unlike LLMs, which trained on oceans of text already sitting online, robots don’t have a web-scale archive of useful motion, force, manipulation, and embodied interaction data.

XOOMAR analysis: XDOF is selling the shovel in a race where everyone wants to mine “physical AI,” but few want to build the warehouse, hire the operators, calibrate the machines, and label the footage.

XDOF robot training data starts with 20 customers and a $70 million war chest

XDOF, pronounced “ecks-doff,” has about 60 employees and is already working with 20 customers, including several frontier AI labs, co-founder and CEO Philippe Wu told TechCrunch. He declined to name them.

“All of the top labs are trying to pursue robotics,” Wu said. “We’ve already seen some of the downfalls of falling a little bit behind in the language model race … you don’t want to be in this type of situation where you pursue this technology too late, and everyone is in this boat where physical AI is the next frontier.”

That quote says more than a launch announcement. AI labs don’t want to repeat the LLM race from behind. If robot foundation models become strategically important, access to high-quality XDOF robot training data could become as important as access to GPUs or research talent.

Wu’s own path explains the company’s thesis. As a PhD student at UC Berkeley, he worked on enabling robots to learn skills from large-scale datasets. The blocker was not theory. It was supply.

“We didn’t have large-scale data to work with,” Wu told TechCrunch. “There was this chicken-and-egg problem — we first needed to actually collect data before we could even ask how to train a foundation model for robotics.”

That gap became the business.

130,000 trajectories show how scarce good robotics data still is

XDOF is partnering with UC Berkeley’s AI Research lab to release ABC, a dataset the company believes is the largest collection of high-quality robot training data ever assembled.

It includes:

130,000 trajectories of robot manipulation data
300 hours of simulation
100 hours of evaluations

The team has already used the data to train robots on benchmark tasks such as folding T-shirts, flattening boxes, and loading AirPods into their cases.

That is meaningful scale for robotics. It also shows the brutal mismatch with language AI. Text and image models benefited from data that already existed in public or semi-public digital form. Robot data has to be produced through physical action. Every example may involve hardware, space, objects, cameras, human operators, maintenance, calibration, and later annotation.

Data source	Why it scales	Why robotics is harder
Text and images	Large stores already existed online	Physical interaction is not naturally archived at useful fidelity
YouTube and gig footage	Easy to collect in volume	TechCrunch says it can be low-fidelity and hard to reconcile with the physical world
Robot teleoperation	Produces targeted demonstrations	Requires robots, operators, calibration, and task setup
Simulation	Can generate variation	Still needs real-world grounding and evaluation

XOOMAR analysis: the numbers around ABC are not just a flex. They expose the supply chain problem. If 130,000 trajectories is release-worthy scale, then the industry is still early in building the equivalent of a serious physical data layer.

That echoes a broader pattern in AI infrastructure. Investors don’t only fund glamorous models. They also fund the systems that make models testable, auditable, and usable, a theme we covered in $27M Bet Pushes Pramaana Labs to Make AI Prove Itself.

Clean robot demos don’t solve messy physical work

Robotics demos are persuasive because they compress difficulty into a clean clip. But XDOF’s business exists because deployment requires more than a robot completing a task once under controlled conditions.

TechCrunch reports that Wu and XDOF co-founder and CTO Fred Shentu previously worked on GELLO, a low-cost teleoperation system that lets a human operator control a robotic arm to generate training data. Wu said the paper became influential because “a lot of people had similar needs and bottlenecks.”

The bottleneck is not just data volume. It’s data fit.

XDOF plans to operate across three tiers of a data pyramid:

Robot-specific teleoperation: data collected on the actual robot being deployed
General teleoperated robot data: systems like GELLO collecting broader manipulation examples
Egocentric human data: humans performing everyday tasks, captured through wearable sensors XDOF plans to build

Wu’s point on hardware choice is sharp because it cuts against the idea that any footage will do.

“Your camera choice is going to affect the quality of your data — which is going to affect how your hand-tracking algorithm performs,” Wu said. “If you don’t design the hardware well from the start, the data you collect might have very specific problems that you didn’t anticipate.”

XOOMAR analysis: this is where XDOF’s value moves beyond “data vendor.” If the company can shape collection tools, annotation systems, and evaluation workflows together, it can sell a feedback loop, not just a dataset.

Robotics wants a shared-dataset moment without an internet-sized shortcut

David McAllister, a Berkeley PhD student who helped organize the ABC release, framed the academic upside directly.

“We’ve seen in language, image generation, and other fields, that when models and data are released, the community achieves things that you wouldn’t necessarily have expected,” McAllister told TechCrunch.

That is the optimistic case for ABC. Shared data can create unexpected research gains. It can also make benchmarks less anecdotal and more comparable.

But robotics faces a harsher scaling curve than software-native AI. The web did not accidentally record enough high-quality robot manipulation data. Companies have to manufacture it. That means people operating robots, people wearing sensors, people maintaining equipment, and people deciding what counts as useful training material.

Wu is blunt about why major labs may outsource this work.

“You need a warehouse of hundreds of thousands of square feet with hundreds of robots,” Wu said. “You need to maintain these robots, calibrate their physical parameters, and properly train operators.”

That is not a typical research lab function. It sounds closer to logistics, workforce management, and data operations. The physical AI race may be won partly by whoever can turn that operational grind into repeatable infrastructure.

A related labor question is already visible in other automation pushes. As we reported in 500 Bowls an Hour Pits Wonder Robot Kitchen Against Labor, robotics stories often become labor stories once machines leave the demo floor.

AI labs, robot makers, workers, and customers are pulling on the same data pipeline

For frontier labs, outsourced robot training data offers speed. They can pursue robotics without first building a giant physical data operation from scratch.

For robot companies, better data could make models less brittle across tasks. TechCrunch notes that XDOF is not focused only on data provision. It is also building data cleaning, tooling, and annotation systems, which are meant to create a self-reinforcing loop for robot trainers.

For workers, XDOF’s model points to a new labor layer in AI. The company plans to hire and train armies of teleoperators and egocentric data operators around the world. That work may be repetitive. It may also become essential, much like labeling work became essential to earlier AI systems.

For customers, the issue is trust. If outsourced datasets shape how robots behave, buyers will eventually care about where the data came from, how it was evaluated, and whether the model’s performance translates to their own environment. That is XOOMAR analysis, but it follows directly from XDOF’s focus on collection quality, hardware design, annotation, and evaluation.

The name XDOF captures the ambition. It plays on “degrees of freedom,” the robotics term for independent motions a robot can perform. TechCrunch notes that a human arm from shoulder to wrist has seven degrees of freedom, while Figure.AI’s latest robot has 30.

Wu said the “X” means: “Arbitrary degrees of freedom, unlimited degrees of freedom.”

Proprietary robot training data may become the moat

Hardware alone is unlikely to be enough if competitors can buy similar components and train similar models. XOOMAR analysis: the harder moat may be proprietary embodied data, specialized collection systems, and evaluation loops tied to real tasks.

XDOF appears built around that logic. It is not merely collecting footage. It wants to own the pipeline around data collection tools, data cleaning, annotation, and feedback for model trainers.

That matters because physical AI has less room for vague claims. A chatbot can fail softly. A robot fails in space, around objects, people, equipment, and time-sensitive workflows. The source material does not give safety incident data, so we should not overstate the risk. But the operational stakes are plainly different when the model controls hardware.

The risk for smaller robotics teams is also clear. If the best data pipelines are expensive, labor-intensive, and tied up by frontier labs or large robotics companies, smaller players may face a harder path. They may depend more on limited datasets, simulation, or narrow in-house collection.

That does not mean XDOF wins by default. It means the market is moving toward a harder question: who controls the physical data layer?

XDOF-style data factories could decide which robots leave the demo floor

The next phase to watch is not whether robotics labs can produce better videos. It is whether they can build or buy repeatable data systems that improve models across real tasks.

Evidence that would strengthen XDOF’s thesis includes more named frontier lab customers, broader adoption of ABC, measurable gains on benchmark tasks, and proof that its three-tier data pyramid improves model performance beyond isolated demonstrations.

Evidence that would weaken it would be just as important: if simulation reduces the need for real-world collection faster than expected, if labs decide to build giant internal data operations, or if XDOF’s datasets fail to generalize beyond the environments where they were collected.

For now, the signal is clear. XDOF robot training data is turning unglamorous physical work into AI infrastructure. The companies that master that grind may shape embodied AI more than the ones with the slickest launch clips.

The Bottom Line

Robotics may be entering a new race where proprietary physical-world data becomes a key advantage.
XDOF’s early customer traction suggests major AI labs are outsourcing the hardest parts of robot training.
OpenAI’s robotics relaunch shows frontier labs increasingly see physical AI as the next major battleground.

Area	LLM Training	Robot Training
Data availability	Large volumes of text already existed online	Useful motion, force, manipulation, and embodied interaction data is scarce
Core workflow	Train on digital text datasets	Collect, clean, annotate, evaluate, and repeat physical interactions
Main bottleneck	Model and compute race	Real-world data collection at scale

XDOF Wrings $70M From Dirty Robot Training Data Race

Analyst Take

XDOF robot training data starts with 20 customers and a $70 million war chest

130,000 trajectories show how scarce good robotics data still is

Clean robot demos don’t solve messy physical work

Robotics wants a shared-dataset moment without an internet-sized shortcut

AI labs, robot makers, workers, and customers are pulling on the same data pipeline

Proprietary robot training data may become the moat

XDOF-style data factories could decide which robots leave the demo floor

The Bottom Line

LLM Training Data vs. Robot Training Data

XDOF Launch Metrics

Sources

XOOMAR Insights Team

Explore More Topics

Related Articles

Hidden Fees Warp LLM API Pricing Beyond Token Costs

Best Data Room Software to Stop Fundraising Chaos Fast

Investor Checks Ride on Startup Data Room Software

Startup Data Room Checklist Investors Won't Pick Apart

No-Code RAG Chatbot Turns Internal Docs Into Answers

Password Manager vs Browser Password Manager, Who Wins?

VPN Kill Switch Blocks IP Leaks When Tunnels Drop Suddenly

Privacy Toolkit Locks Down Everyday Browsing Without Pain

White House Forces Anthropic Fable Shutdown in AI Feud

Best Antivirus for Low-End PCs That Won't Choke Windows

Don't miss the signal