https://arxiv.org/pdf/2411.02359
https://github.com/yueyang130/DeeR-VLA.
Let me help you format each sentence with line breaks:
Multimodal Large Language Models (MLLMs) have demonstrated remarkable comprehension and reasoning capabilities with complex language and visual data.
These advances have spurred the vision of establishing a generalist robotic MLLM proficient in understanding complex human instructions and accomplishing various embodied tasks.
However, developing MLLMs for real-world robots is challenging due to the typically limited computation and memory capacities available on robotic platforms.
In contrast, the inference of MLLMs involves storing billions of parameters and performing tremendous computation, imposing significant hardware demands.
In our paper, we seek to address this challenge by leveraging an intriguing observation: relatively easier situations make up the bulk of the procedure of controlling robots to fulfill diverse tasks, and they generally require far smaller models to obtain the correct robotic actions.
Motivated by this observation, we propose a Dynamic Early-Exit Framework for Robotic Vision-Language-Action Model (DeeR VLA, or simply DeeR) that automatically adjusts the size of the activated MLLM based on each situation at hand.
The approach leverages a multi-exit architecture in MLLMs, which allows the model to terminate processing once a proper size of the model has been activated for a specific situation, thus avoiding further redundant computation.
Additionally, we develop novel algorithms that establish early-termination criteria for DeeR, conditioned on predefined demands such as average computational cost (i.e., power consumption), as well as peak computational consumption (i.e., latency) and GPU memory usage.
These enhancements ensure that DeeR operates efficiently under varying resource constraints while maintaining competitive performance.
Moreover, we design a tailored training method for integrating temporal information on top of such multi-exit architectures to predict actions reasonably.
On the CALVIN robot manipulation benchmark, DeeR demonstrates significant reductions in computational costs of LLM by 5.2-6.5x and GPU memory of LLM by 2-6x without compromising performance.
3.1 Multi-exit Architecture for Robot