Framework for Automatic Building Construction Based on Large Multimodal Models for Autonomous Agents

Soma Watanabe

9:00 AM - 9:20 AM

[4J1-GS-5-01] Framework for Automatic Building Construction Based on Large Multimodal Models for Autonomous Agents

〇Soma Watanabe¹, Shiyao Ding¹, Takayuki Ito¹ (1. Kyoto University)

Keywords:Autonomous Agent, Simulation, Multimodal Processing, Computer Vision

In recent years, Large Language Model (LLM) agents operating in virtual environments have been advancing. However, in Minecraft building tasks, some agents require human feedback, which incurs substantial costs. This research proposes a framework that enables the reproduction of target buildings without human intervention by using Large Multimodal Models (LMM) for automated feedback. In this framework, the agent inputs images of both the target and current building to the LMM, receives feedback, and iteratively makes modifications. To improve the framework's performance, we attempted the following enhancements: (1) part-by-part recognition, (2) reuse of past instructions, (3) introduction of intermediate steps, (4) multi-agent implementation, and (5) input from multiple viewpoints. Results showed performance improvements with methods (1), (2), (3), and (4), while challenges remained with the implementation of method (5).

Authentication for paper PDF access
A password is required to view paper PDFs. If you are a registered participant, please log on the site from Participant Log In.
You could view the PDF with entering the PDF viewing password bellow.

Presentation information

[4J1-GS-5] Agents:

[4J1-GS-5-01] Framework for Automatic Building Construction Based on Large Multimodal Models for Autonomous Agents

Password