We aim to build a useful, reproducible, democratized benchmark for learning household robotic manipulation from human videos. To realize this goal, a diverse, high-quality human video dataset curated specifically for robots is desired. To evaluate the learning progress, a simulated twin environment that resembles the appearance and the dynamics of the physical world would help roboticists and AI researchers validate their algorithms convincingly and efficiently before testing on a real robot. We introduce RoboTube, a benchmark platform that can lower the barrier to robotics research while facilitating reproducible research in the community.


We build a diverse and high-quality human video demonstration dataset with multiple functionalities.

Construction Overview


To benchmark the baseline methods, we construct a suite of simulated twin environments, RT-sim. With RT-sim, researchers can make a fair comparison of their approaches with the baseline methods and can validate their algorithms convincingly and efficiently before conducting more complex experiments on real robots.

Paper & Code

Latest Paper Version: OpenReview,

Github(Coming Soon)

Paper Thumbnail



To cite this work, please use the following BibTex entry,
title={RoboTube: Learning Household Manipulation from Human Videos with Simulated Twin Environments},
author={haoyu Xiong and Haoyuan Fu and Jieyi Zhang and Chen Bao and Qiang Zhang and Yongxi Huang and Wenqiang Xu and Animesh Garg and Cewu Lu},
booktitle={6th Annual Conference on Robot Learning},
If you have any questions, please feel free to contact Haoyu Xiong.