EgoGuide: Egocentric Guidance for Efficient Robot-Free Demonstration Collection and Learning

Yue Xu¹, Mingtao Nie¹, Tianle Li¹, Hong Li¹, Yibo Luo¹, Siyuan Huang³, Yong-Lu Li^1,2*

¹Shanghai Jiao Tong University, ²Shanghai Innovation Institute, ³Beijing Institute for General Artificial Intelligence

Paper Demo Code Coming Later Assembly Guide Coming Later

Overview of EgoGuide online data curation, hardware, and gated egocentric residual policy

EgoGuide improves UMI-style robot-free data collection with AR-based coverage guidance and learns to use egocentric context through a gated residual policy.

Abstract

Robot learning from real-world demonstrations is currently constrained by data scaling. Universal Manipulation Interface (UMI) provides an efficient robot-free data collection interface, yet current UMI-style pipelines often collect redundant demonstrations and lack global scene context. To improve data efficiency, we present EgoGuide, a collection interface that records synchronized wrist and head/egocentric observations and couples them with online visual-geometric data quality guidance. We also introduce a Gated Egocentric Residual Policy for robust learning from a viewpoint-varying egocentric camera, allowing head/egocentric context to correct ambiguous local observations while preserving stable wrist-view control. Real-world experiments show that EgoGuide reduces the required number of data episodes and improves data efficiency. The residual policy further improves robustness under visual occlusion.

TL;DR

# 1: Guided Data Collection

EgoGuide tells the collector, inside AR, whether the current wrist view, ego view, and wrist pose are already covered by the dataset. In plain terms: it nudges people away from recording another near-duplicate demo and toward useful new states.

# 2: Gated Ego-residual Policy

GERP keeps the wrist camera as the stable default policy, then lets the egocentric camera make a gated correction when the local wrist view is ambiguous, occluded, or missing broader task context.

10% -> 50%

EgoGuide raises success on pepper sorting (200 demos).

2X Speed

Pepper Sorting reaches comparable success using only half as many demonstrations.

+5% to +10%

GERP improves Pepper Sorting success and task progress over wrist-only policies under harder perception.

System

EgoGuide-UMI synchronizes a wrist camera, wrist pose, gripper state, head/egocentric image, and head pose. The workstation estimates data coverage online and sends simple feedback back to the AR interface before recording.

Results

Data scaling comparison between unguided collection and EgoGuide-guided collection

Across standard UMI-style tasks and challenging occlusion cases, EgoGuide-guided data scales better than unguided collection, while the gated residual policy uses egocentric context without replacing the stable wrist-view controller.

Demos

The demonstrations are collected and evaluated in different scenes across more than 100 km away, to show the policy generalization and robustness.