【SLAI Seminar】第二十三期:跨越虚实鸿沟:基座模型驱动的具身智能泛化之路 Bridging the Virtual-Physical Gap: The Path to Generalization of Embodied Intelligence Driven by Foundational Models (Jan 19,14:30)
SLAI Seminar 23rd Session will be discussing the topic on "Bridging the Virtual-Physical Gap: The Path to Generalization of Embodied Intelligence Driven by Foundational Models ", from 2:30pm to 4pm, January 19th (Monday) at B411 Lecture Hall, online participation is welcome (Tencent Meeting ID: 250-582-805)
报告主题:跨越虚实鸿沟:基座模型驱动的具身智能泛化之路
时间:2026年1月19日(周一)下午14:00-15:30
地点: 深圳河套学院B411阶梯教室
线上参与:腾讯会议号250-582-805
讲者简介 About the Speaker:
苏航,清华大学计算机系副研究员,入选国家“万人计划”青年拔尖人才,主要研究鲁棒机器学习和具身决策等相关领域,发表CCF推荐A类会议和期刊论文100余篇,谷歌学术论文引用15000余次,受邀担任人工智能领域顶级期刊IEEE TPAMI和Artificial Intelligence的编委,IEEE生成式大模型安全工作组主席,获得吴文俊人工智能自然科学一等奖,ICME铂金最佳论文、MICCAI青年学者奖和AVSS最佳论文等多个学术奖项,曾率队在NeurIPS2017对抗攻防等多个国际学术比赛中获得冠军。现任中国图像图形学会青工委执委、曾任VALSE执行AC委员会主席,NeurIPS21的领域主席(Area Chair)、AAAI22 Workshop Co-Chair等。
Prof. Su Hang is an associate research fellow in the Department of Computer Science and Technology at Tsinghua University and has been selected for the National "Ten Thousand Talents Plan" Young Top Talent program. His main research areas include robust machine learning and embodied decision-making. He has published over 100 papers in CCF-recommended A-class conferences and journals, with over 15,000 Google Scholar citations. He has been invited to serve as an editorial board member for top-tier artificial intelligence journals such as IEEE TPAMI and Artificial Intelligence, and chairs the IEEE Generative AI Security Working Group. He has received numerous academic awards, including the Wu Wenjun Artificial Intelligence Natural Science First Prize, the ICME Platinum Best Paper Award, the MICCAI Young Scientist Award, and the AVSS Best Paper Award. He has also led teams to win championships in several international academic competitions, such as the NeurIPS
2017 Adversarial Attack and Defense Challenge. Currently, he serves as an executive committee member of the Youth Working Committee of the China Society of Image and Graphics. He previously served as the Chair of the VALSE Executive AC Committee, Area Chair for NeurIPS 21, and Workshop Co-Chair for AAAI 22.
报告摘要 Abstract:
泛化能力不足是制约具身智能走出实验室、适应复杂真实环境的核心瓶颈。随着基础模型(Foundation Models)在语言与视觉领域取得突破,构建面向具身智能的基座模型,已成为推动其跨任务、跨平台迁移的关键路径。本报告围绕该主线,提出一套以数据驱动、能力演化为导向的系统性策略,依托“真实数据—仿真数据—视频数据”三类数据,分阶段推进具身基础模型在泛化能力上的逐步跃升。首先,我们从高质量的真实机器人数据出发,融合物理先验与跨本体的多模态扩散模型预训练,构建了统一的动作空间模型;在双臂操作任务中,该模型展现出强鲁棒性与良好的迁移性能,显著提升了对真实物理环境的适应能力与控制一致性。在此基础上,我们引入中等规模的仿真数据,基于 ManiBox 框架,提出边界框引导的策略蒸馏技术,有效缓解了仿真到现实(Sim2Real)的迁移鸿沟。最后,我们探索大规模、低结构的视频数据在弱监督场景下的潜力,设计融合扩散模型预训练与掩码动作建模机制的视频动作模型,推动从视觉输入到具身控制的跨模态知识迁移,进一步增强了模型的感知泛化能力与跨平台部署灵活性。总体来看,这一从“高质量小规模”到“低质量大规模”的数据演化路径,为具身基座模型能力的分层跃迁提供了系统支撑,进一步为其向通用化与工业化方向演进奠定了理论基础与技术路径。
The lack of generalization ability is a core bottleneck preventing embodied intelligence from moving beyond the laboratory and adapting to complex real-world environments. With the breakthroughs of foundational models in language and vision, building foundational models for embodied intelligence has become a key pathway to enable cross-task and cross-platform transfer. This report focuses on this theme and proposes a systematic, data-driven, and capability-evolution-oriented strategy. Leveraging three types of data—real-world data, simulated data, and video data—we advance the step-by-step enhancement of the generalization capabilities of embodied foundational models in stages. First, starting from high-quality real-world robot data and integrating physical priors with cross-ontology multimodal diffusion model pre-training, we construct a unified action-space model. In dual-arm manipulation tasks, this model demonstrates strong robustness and excellent transfer performance, significantly improving adaptation to real-world physical environments and control consistency. Building on this, we introduce medium-scale simulated data and, based on the ManiBox framework, propose a bounding-box-guided policy distillation technique to effectively mitigate the Sim2Real transfer gap. Finally, we explore the potential of large-scale, low-structure video data in weakly supervised scenarios, designing a video-action model that combines diffusion model pre-training with masked action modeling. This drives cross-modal knowledge transfer from visual input to embodied control, further enhancing the model's perceptual generalization capabilities and cross-platform deployment flexibility. Overall, this evolutionary path from "high-quality, small-scale" to "low-quality, large-scale" data provides systematic support for the hierarchical advancement of embodied foundational model capabilities, laying a theoretical foundation and technical pathway for their evolution toward generalization and industrialization.