2:20 PM - 2:40 PM
[4O3-OS-16e-02] Vision-Language-Conditioned Diffusion Policies for Robotic Control
[[Online]]
Keywords:Robotics, Diffusion Model
Achieving robots capable of understanding human language and autonomously determining actions based on it is a significant research challenge in the fields of robotics and machine learning. If robots can accurately grasp the intentions embedded in humans' abstract instructions and execute appropriate controls, it is expected that assistance to humans and task execution efficiency will greatly improve.
In this paper, we propose a imitation learning method for robot control to autonomously determine actions based on human language instructions and goal images, named Vision-Language-conditioned Diffusion Policy (VLDP). Traditional language-based robot control methods have been inadequate in fully modeling the inherent ambiguity and polysemy present in human language. VLDP addresses this issue by extracting semantics from human language instructions and goal images through a visual language model and conditioning them on a Diffusion Policy. This enables the robot to generate multiple valid actions in response to instructions containing linguistic ambiguity.
Experiments evaluate the success rate of action generation based on language instructions, the ability to adapt to unseen language instructions, and the multimodality of actions generated by the proposed method.
In this paper, we propose a imitation learning method for robot control to autonomously determine actions based on human language instructions and goal images, named Vision-Language-conditioned Diffusion Policy (VLDP). Traditional language-based robot control methods have been inadequate in fully modeling the inherent ambiguity and polysemy present in human language. VLDP addresses this issue by extracting semantics from human language instructions and goal images through a visual language model and conditioning them on a Diffusion Policy. This enables the robot to generate multiple valid actions in response to instructions containing linguistic ambiguity.
Experiments evaluate the success rate of action generation based on language instructions, the ability to adapt to unseen language instructions, and the multimodality of actions generated by the proposed method.
Authentication for paper PDF access
A password is required to view paper PDFs. If you are a registered participant, please log on the site from Participant Log In.
You could view the PDF with entering the PDF viewing password bellow.