Oct 31

Frontiers of Robotics Intelligence: A Review of Vision and Language Models for Robots

Robot and human hand with light bulb and brain

Authors: Natasha Nejat, Aaron Nguyen, Aria Ramsinghani, and Arjun Ravichandran

Mentor: Kyle Fogarty. Kyle is a doctoral candidate in the Department of Engineering at the University of Cambridge with research specialization in computer graphics, computer vision, and machine learning.

Abstract

Vision-Language-Action (VLA) models represent a significant advancement in robotics. By integrating visual perception, natural language understanding, and action execution to enable robots to perform complex tasks in unstructured environments. While the current state-of-the-art (SOTA) robotics methods primarily rely on reinforcement learning (RL), which focuses on learning from interactions within an environment through trial and error to maximize rewards, VLA models extend beyond RL's limitations by allowing robots to generalize to new tasks through multimodal inputs. This paper reviews recent advances in VLA models, focusing on how these technologies converge to enable sophisticated autonomous robotic systems. Additionally, it considers the ethical implications associated with robotics, including implications for labor markets and data privacy. Finally, this paper review highlights existing challenges and outlines potential directions for future research.

Introduction

Robotic and autonomous systems have dramatically transformed our world, taking on tasks that are often deemed too dangerous, tedious, or time-consuming for humans to perform efficiently. Their evolution is rooted in decades of technological advancements, beginning with early automation and culminating in today’s sophisticated AI-driven systems. These systems, which can now often operate at near human capabilities, have become integral to a wide range of sectors, including manufacturing (Yuan & Lu, 2023), space exploration (Pedersen et al., 2019), healthcare (Morgan et al., 2022), and agriculture (Cheng et al., 2023). Despite their impressive capabilities, robots still face significant challenges, particularly when it comes to adapting to unpredictable environments and executing tasks that require complex decision-making and task planning (Jain et al., 2016).

Historically, robots required explicit instructions for each individual movement and were unable to adapt to novel or unforeseen circumstances. However, advancements in modern technologies, particularly in Artificial Intelligence (AI), has improved robots' abilities to recognize patterns, make predictions, and modify their actions based on real-time data, enabling them to adapt to complex and dynamic environments with greater precision and autonomy. The complexity of achieving robust robotic intelligence arises from several key challenges: One of the primary difficulties is enabling robots to operate effectively in unpredictable environments, such as self-driving cars navigating complex traffic conditions, robotic explorers on other planets, or underwater drones conducting deep-sea exploration. These scenarios require precise coordination of numerous joints and actuators, where the increasing degrees of freedom make real-time control computationally intensive. Additionally, quick decision-making is essential, particularly in dynamic environments. Overloading sensors with excessive data or noise can further complicate processing, leading to incomplete or inaccurate information.

Vision-Language-Action (VLA) models potentially mark a shift in robotics research, which had previously relied on specialized models for specific tasks. In contrast, VLA models leverage "open vocabulary" foundation models, which are trained on vast amounts of internet data, allowing robots to understand diverse natural language commands, perceive complex environments, break down complex tasks via chain of thought reasoning, and operate in-the-wild. This general approach may enable robots to adapt to unstructured, real-world environments, offering far greater flexibility than specialized models.

This paper explores the frontiers of research in Vision-Language-Action (VLA) models. It is organised as follows: Section 2 and 3 provides the necessary technical background to situate VLA models within the broader field of robotics and machine learning. Section 4 offers a review of recent literature, divided into four key categories: Task Understanding and Execution, Perception and Manipulation, Architecture and Integration, and Real-world Applications. Section 5 discusses key insights from the reviewed literature, and section 6 presents the conclusions and potential directions for future research.

Background

Robotic Intelligence

Over the past 45 years, robotics research has focused on addressing the technical needs of applied robotics, driven largely by human necessities. In the 1960s, industrial robots were introduced to factories to free workers from dangerous tasks. As robotics expanded into various production processes, the demand for greater flexibility and intelligence in robots grew. Intelligence in robotics refers to the ability of a robot to perceive, reason and make decisions to perform tasks autonomously and adaptively. Today, emerging markets and societal needs, such as cleaning, demining, construction, agriculture, and elderly care, are driving the development of field and service robots (Garcia, et al, 2007).

Robots function through a combination of sensors, actuators, and control systems. Sensors gather information about the robot's environment or internal state, while actuators execute physical actions, such as movement or manipulation. The challenge of autonomous control lies in determining the best course of action based on these sensor inputs, which involves defining a policy — a set of actions for each possible state the robot may encounter.

Robotic systems have evolved significantly since their inception, transitioning from simple, rule-based frameworks to sophisticated, learning-based models. This evolution reflects the growing need for robots to perform increasingly complex tasks in ever-changing environments. The traditional rule-based systems, while effective for well-defined and predictable tasks, have proven inadequate for scenarios requiring adaptability and nuanced decision-making (Leusmann, et al, 2024). The adoption of learning-based approaches, especially those using machine learning (ML) and artificial intelligence (AI), marks a significant advancement in robotics, but also presents complex challenges such as job displacement, misuse of technology, unequal access, and potential life-threatening malfunctions.

Early robotic systems were predominantly rule-based, relying on explicit programming to perform tasks. The primary advantage of rule-based systems lies in their simplicity and predictability. Because every possible state and corresponding action is predefined, these robots can execute tasks reliably within their designated operational parameters (Hayes-Roth, F. 1985). However, the rigidity of this approach also represents its most significant limitation. Rule-based systems struggle to cope with environments that are unpredictable or require tasks that cannot be easily codified into discrete rules. When faced with novel situations, these robots typically fail, as they lack the capability to generalize from past experiences or adapt to new conditions (Kwasny, et al, 1990).

Learning-based systems offer several advantages over their rule-based predecessors. First, they can generalize from past experiences, allowing them to tackle novel situations with a degree of flexibility that rule-based systems cannot achieve. Second, these systems can continuously improve their performance through iterative learning processes, making them well-suited for tasks where optimal solutions are not readily apparent. For instance, in robotic path planning, a learning-based system can explore and adapt to different terrains, optimizing its route over time rather than following a rigidly predefined path.

Neural Networks

Neural networks, particularly deep learning models, have become pivotal in advancing robotic capabilities. In general, neural networks are function approximations that learn to map input data (text, images, audio, etc) into some form of output (actions in the case of robotics). They are composed of connected layers of nodes (similar to human neurons) that transform inputs through weights, which are learnt from training data (LeCun et al., 2015). Here, we focus on several key applications of neural networks in robotics.

For perception, neural networks, especially Convolutional Neural Networks (CNNs), enable robots to interpret visual data, such as recognizing objects and understanding scenes. These capabilities are critical in applications like self-driving vehicles, where CNNs process camera images to detect pedestrians, road signs, and other vehicles (Krizhevsky et al., 2012).

In control, neural networks are combined with Reinforcement Learning (RL) to optimize robotic movement and decision-making. Reinforcement Learning (RL) enables robots to learn by interacting with their environment and receiving feedback in the form of rewards or penalties. Instead of following predefined instructions, the robot explores different actions, observes their outcomes, and adjusts its behavior to maximize cumulative rewards. Over time, the robot develops an optimal policy—a strategy for selecting actions that lead to the best possible outcomes. Neural networks are often used in RL to approximate complex relationships between the robot's sensory inputs and its actions, allowing the system to handle diverse and unpredictable situations with improved performance. Deep Q-Networks (DQN), for instance, have been used to train robots in complex tasks like manipulation and navigation (Mnih et al., 2015). More recently, algorithms like Proximal Policy Optimization (PPO) have further refined robotic control by improving stability and efficiency in learning from interactions with the environment (Schulman et al., 2017).

Finally, the Transformer architecture, first introduced by Vaswani et al. (2017), has become the foundational model for processing sequential data, particularly in the context of text. Transformers are the underlying technology behind many of the recent chatbots like GPT4 (Achiam et al., 2023). By utilizing the self-attention mechanisms to capture global dependencies within input sequences, enabling more efficient parallelization. This scalability and effectiveness has driven its widespread adoption across natural language processing tasks, and it has been adapted for various modalities beyond text, including vision. As Transformer can handle multiple modalities, they have also become the model of choice for jointly modelling vision and language in Vision-Language Models (VLMs).

Vision-Language Models

Vision-Language Models (VLMs) are a class of machine learning models designed to process both visual and language data, enabling jointly learnt representations. Integrating vision, language, and action models allows for the transfer of web-scale knowledge to robotic control, enabling robots to perform complex tasks based on visual and language inputs (Brohan et al., 2023). This integration mirrors how humans process information, using language to describe what they see and to guide their actions. By aligning visual inputs with language descriptions, VLMs allow robots to perform tasks such as object recognition, spatial reasoning, and task planning.

Vision-language models have become a powerful tool for addressing robotics challenges, especially in scenarios that demand zero-shot generalization (Wang et al., 2024). These models excel at interpreting visual data and correlating it with language-based instructions, enabling robots to perform tasks that require both comprehension and action.

The development of vision-language models has progressed from simple image description tasks to complex systems capable of multimodal reasoning and action. Early models focused on generating textual descriptions from visual data, but these often fell short in contextual understanding, limiting their practical applications. The challenge of interpreting instructions for everyday tasks has driven the evolution of vision-language models, leading to benchmarks like ALFRED, which assess a model's ability to understand and act on multimodal instructions (Shridhar et al., 2020).

A core concept in advancing vision-language models for robotics is chain-of-thought reasoning. This method involves guiding the model to decompose tasks into sequential reasoning steps, thereby improving its ability to handle complex operations. In the context of robotic control, chain-of-thought reasoning facilitates embodied reasoning, where the robot systematically reasons through a series of actions by integrating both visual and language inputs (Zawalski et al., 2024). This approach enhances the model’s capacity to generalize across varying task complexities. For example, in autonomous navigation, a vision-language-action model could employ chain-of-thought prompting to first identify key objects in its environment, plan a safe route around obstacles, and execute navigation.

Literature Review

This literature review explores the recent advancements in VLMs and their applications in robotics, focusing on four key areas: task understanding and execution, perception and manipulation, architecture and integration, and evaluation and real-world applications.

Task Understanding and Execution

Task understanding in robotics encompasses a model's capacity to perceive its environment, interpret instructions, and formulate action plans to achieve specific objectives. This ability is paramount for autonomous systems to function effectively in dynamic, real-world environments, with long-horizon tasks.

Most robotic planning systems are designed to operate in controlled environments where they possess complete knowledge of their surroundings. However, in real-world applications, robots frequently encounter unexpected situations that these systems are not equipped to handle (Haslum et al., 2019). Zhang et al. (2023) explore the potential of enhancing conventional task planning systems through the integration of a large language model. In their study, the authors propose a novel framework, COWP, aimed at open-world task planning and situation handling. This system improves the robot’s understanding of actions by incorporating contextually relevant information from the language model. Compared to other baseline approaches (Huang et al., 2022, Jiang et al., 2019, Singh et al., 2023), COWP achieved higher success rates in service tasks, demonstrating the effectiveness of language models in dynamic and unpredictable environments.

Similar is the concept of operating a robot in the zero-shot setting, where the robot functions without explicit training for a specific task or scene. Recent research in robotic planning and manipulation has explored few-shot and zero-shot methods, as evidenced by studies from Huang et al. (2022a, 2022b, 2022c), and Liang et al. (2023). These approaches aim to handle unfamiliar situations without prior training, primarily focusing on high-level planning. However, their adaptability in complex or dynamic environments is often limited by their reliance on predefined programs or external control modules. One recent work (Wang et al., 2024) partly addressed these limitations by demonstrating that a single off-the-shelf VLM can autonomously manage all aspects of a robotics task, from high-level planning to low-level location extraction and action execution. This approach reduces dependence on predefined programs or external modules, offering greater flexibility and adaptability in dynamic and complex environments.

Perception and Manipulation

Perception and manipulation are fundamental to robots being able to interact with their environment. Recent developments in VLAs, and its underlying components, have enhanced robots' capabilities in object recognition and spatial reasoning, which are fundamental to autonomous decision making.

Manipulation planning leverages visual input to generate trajectories for moving objects between locations, a process that is particularly challenging due to the need for complex spatial reasoning, real-time environmental awareness, and precise motor control. Successfully navigating objects through cluttered and dynamic environments requires careful integration of these elements. An early effort to unify vision and language for such tasks is presented in the Gato model (Reed et al., 2022). Gato combines target objectives, visual data, and contextual information to plan and execute tasks like assembling components or reorganizing objects in complex settings. Although Gato demonstrated potential in robotic control, its high computational demands posed challenges for real-time operation. Additionally, while vision and language data at web scale were readily available, the authors pointed out that "a web-scale dataset for control tasks is not currently available." As a result, they had to rely on data from simulated training of reinforcement learning (RL) agents for many control tasks.

In more recent work, RT-2 (Robotics Transformer 2) built upon these ideas by directly incorporating large vision-language models into end-to-end robotic control, leveraging the rich semantic knowledge from internet-scale pretraining to enhance generalization and emergent capabilities in robotic tasks (Brohan et al. 2023). While this closer integration of vision and language approaches relaxed the need for reinforcement learning, the system instead used behavior cloning on expert demonstrations. RT-2 still faced significant challenges with respect to real-time control. Despite these computational hurdles, RT-2 demonstrated impressive generalization to novel objects, environments, and instructions, showcasing the potential of transferring knowledge from large-scale vision-language models to robotic control tasks.

Robot manipulation tasks can span a wide range of required skills and task specifications, including instruction following, one-shot imitation, and object rearrangement. Among these, grasping is particularly complex, as it involves both meticulous planning and precise execution. Recent advancements, such as the VIMA (VisuoMotor Attention) model, have shown that grasping can be framed as a multimodal problem under the VLM framework, where visual and text prompts are integrated to generate effective grasping strategies (Jiang et al., 2022). Experimental results presented in the work suggest that the VIMA's ability to process both visual and language inputs simultaneously allows for more accurate and stable execution of grasping tasks, with strong scalability and generalization capabilities.

Architecture and Integration

The field has seen rapid advancements through various strategies aimed at optimizing VLM performance in robotic applications. Combining different models, using multiple forms of data, and pretraining models on large datasets all act to improve vision-language models in the context of robotics (Lyu et al., 2023; Zhou et al., 2024).

Pretraining models on large datasets has proven to be a powerful strategy for improving VLM performance in robotics. The Physically Grounded model (Gao et al., 2024) incorporates an understanding of physical properties of an environment, significantly enhancing reasoning and planning capabilities in robotic systems. Similarly, the Robot Vision model (Li, Li, & Han, 2019) leverages multiple neural networks to improve object recognition and environmental adaptation jointly, showcasing how pretraining on diverse datasets can lead to more robust and versatile VLMs.

Data augmentation techniques have also proven to be valuable in enhancing the performance of VLA models in robotics. However, unlike in vision or text, where basic augmentations are often straightforward to apply, augmenting robotics training data presents unique challenges due to the more complex and dynamic nature of the variables involved. In robotics, training data typically includes interactions with physical environments, which are influenced by factors such as real-world physics, object manipulation, sensor feedback, and environmental variability. To address these complexities, Tan et al. (2019) introduced the concept of environment dropout, which modifies known training environments to resemble unseen ones, alongside an instruction generator that produces corresponding pseudo-instructions. Building on this approach, Parvaneh et al. (2020) proposed a model for generating counterfactuals, effectively simulating unseen environments based on observed ones. In contrast, Li et al. (2022) utilized Generative Adversarial Networks to generate new environmental styles and transfer them to the training environments, thereby increasing the diversity of training environments. These approaches underscore the difficulty in collecting and simulating accurate robotics training data for VLA models, and the creative approaches required to overcome this.

Evaluation and Real-World Applications

Simulators play a crucial role in benchmarking robotic systems due to their ability to consistently reproduce experimental setups. This consistency allows for reliable comparison between different models, which is often infeasible in real-world settings due to environmental variability. As a result, the majority of benchmarks in robotics are simulator-based, providing a standardized platform for evaluating and comparing VLA models (Song et al., 2024).

Within VLM models, Embodied Question Answering (EQA) has emerged as a significant area of focus for benchmarking, where an AI agent navigates and observes a 3D environment to answer questions about it, combining visual perception, language understanding, and decision-making. Pioneer works like EmbodiedQA (Das et al., 2018) and IQUAD (Gordon et al., 2018) laid the foundation, which has since expanded to include more specialized benchmarks. For instance, MT-EQA (Yu et al., 2019) focuses on questions involving multiple targets, while MP3D-EQA (Wijmans et al., 2019) tests 3D perception capabilities using point clouds. EgoVQA (Fan et al., 2019) and EgoTaskQA (Jia et al., 2022) shift the focus to egocentric perspectives and complex reasoning tasks, respectively.

The HomeRobot OVMM Benchmark (Yenamandra et al., 2022) represents a novel approach in robotic evaluation by proposing a sim-to-real benchmark. This initiative aims to bridge the gap between simulated environments and real-world scenarios, potentially offering more realistic assessments of robotic performance. However, the consistency and effectiveness of this benchmark are still under observation, highlighting the ongoing challenges in developing reliable sim-to-real evaluation methods (Song et al., 2024).

Although VLA models have shown remarkable progress in robotics, especially in navigating complex environments, their real-world application remains relatively limited. For instance, recent models like RT-2 (Brohan et al., 2023) have been deployed outside simulation environments, but these deployments are typically confined to laboratory settings and the models function without real-time capabilities.

Discussion

Compare and Contrast

The integration of Vision-Language Models (VLMs) has significantly advanced robotics in both low-level perception and manipulation, as well as high-level task understanding. Perception and manipulation address the challenges of object recognition and physical interaction, while task understanding focuses on interpreting instructions and planning in dynamic environments. VLMs enhance both areas in distinct ways: they provide contextual knowledge and adaptability for task comprehension, and improve object recognition and spatial reasoning for perception and manipulation. This dual application demonstrates VLMs' versatility in bridging the gap between conceptual understanding and practical interaction within robotic systems.

The integration of Vision-Language Models (VLMs) into robotic systems has been implemented in different ways, each with its own pros and cons. Some methods, like Zhang et al.'s (2023) COWP, use VLMs to improve existing planning systems, combining the strengths of traditional robotics with new AI models. Others, such as Wang et al. (2024), aim to use a single VLM to handle all parts of a robotics task, from high-level planning to detailed execution. This divide highlights an important debate in the field: whether to enhance specialized systems or to develop more unified, flexible approaches that may offer greater adaptability but at the expense of specialized performance.

Finally, the evaluation and real-world use of Vision-Language Models (VLMs) in robotics reveal another area of contrast. Most assessments are done in simulated environments, which offer consistency and reproducibility. However, the real-world application of these models is still limited, often restricted to controlled lab settings. Efforts like the HomeRobot OVMM Benchmark (Yenamandra et al., 2022) aim to narrow the gap between simulation and real-world use, but significant challenges remain in achieving comparable performance in real environments. This difference emphasizes the need for more robust, transferable models and evaluation methods that can accurately predict and assess real-world effectiveness.

Ethical and Societal Implications

The integration of visual language models (VLMs) in robotics introduces significant ethical considerations. While these models offer substantial advantages, they also pose serious risks that must be addressed. Key concerns include bias in training data, potential misuse, and the broader impact on employment and societal structures. To ensure these technologies are deployed responsibly and for the benefit of society as a whole, these issues require careful examination and regulation.

Bias in Training Data

One major problem is that VLMs can learn from biased data. If the data used to train these models reflects societal prejudices, the robots might also show these biases. For example, if a VLM is trained on images that mostly show certain groups of people or reinforce stereotypes, the robots might act with bias. Image processing models can be affected by biased training data, causing their results to be skewed or leading to unfair outcomes, such as biased customer service interactions or discriminatory surveillance practices. Addressing these issues requires identifying and mitigating bias during the model development process to ensure that robotic systems act impartially and equitably.

Potential for Misuse

Another significant ethical concern is the potential misuse of VLMs in harmful applications. These models, with their ability to process and respond to visual data, could be leveraged in dangerous ways, such as in weaponized drones or invasive surveillance systems. For example, many of the VLA models could be repurposed to track individuals or guide autonomous weapons, raising severe ethical and security risks. To prevent misuse, there must be strong ethical guidelines and regulations to control how these technologies are used.

Impact on Jobs and Society

As robots continue to advance in their capacity to perform tasks traditionally carried out by humans, they have the potential to replace workers in multiple industries. For example, robots could assume responsibilities in manufacturing, administrative functions, and even domestic work. Many VLA models are already equipped to manage complex tasks that would otherwise demand human labor. Without careful consideration, this technological shift may lead to widespread job displacement and exacerbate economic inequality.

Scope for Future Work - Vision and Language Models for Robots

A major challenge in using Vision-Language Models (VLMs) in robotics is managing the trade-off between computational complexity and the need for real-time performance. Models like Gato (Reed et al., 2022) and RT-2 (Brohan et al., 2023) show impressive abilities, but they face difficulties with real-time execution because of their high computational demands. This tension between the deep, contextual understanding offered by large models and the quick response required in robotics remains a central research issue. Future progress should likely focus on improving these models' speed without compromising their advanced reasoning skills.

As VLA models in robotics continue to evolve, ensuring safety becomes increasingly critical due to robots' direct interaction with the physical environment. The growing complexity of models like RT-2 (Brohan et al., 2023) introduces challenges in both decision-making transparency and real-time control, highlighting the urgent need for robust safety protocols and risk management strategies. However, as the ongoing evaluation of sim-to-real benchmarks such as HomeRobot OVMM (Yenamandra et al., 2022; Song et al., 2024) illustrates, developing reliable safety assessment methods for real-world robotic applications remains a significant challenge. Future work is needed to address these safety concerns as VLA models transition from the laboratory to the real world.

Conclusion

Vision-language-action models represent a significant leap in robotics, combining visual perception, natural language understanding, and action execution to create more versatile and autonomous systems. This review has explored the evolution of these models, highlighting their advancements in task understanding, perception, manipulation, and architectural integration. By incorporating vision and language, robots are now better equipped to process and respond to multimodal inputs, enhancing their ability to generalize across various tasks and environments. However, challenges remain, particularly in achieving real-time control and addressing complex spatial reasoning in real-world settings. Additionally, the ethical considerations of deploying these technologies, such as their impact on labor markets and data privacy, demand thorough scrutiny. Looking ahead, the future of VLA models lies in refining their ability to interact with the physical world, increasing their resilience in unpredictable conditions, and ensuring they adhere to ethical guidelines.

References

Achiam, J., Adler, S., Ahmad, L., Aleman, F. L., Altenschmidt, J., Altman, S., Balaji, S., Bao, H., Bernadett-Shapiro, G., Bogdonoff, L., Campbell, R., Cann, A., Carlson, C., Chen, D., Chen, S., Chen, J., Cho, C., Chu, C., Chung, H. W., … Zheng, T. (2023). GPT-4 Technical Report. arXiv preprint arXiv:2303.08774.

Brohan, A., Brown, N., Carbajal, J., Chebotar, Y., Chen, X., Choromanski, K., Ding, T., Driess, D., Dubey, A., Finn, C., Florence, P., Fu, C., Arenas, M. G., Gopalakrishnan, K., Han, K., Hausman, K., Herzog, A., Hsu, J., Ichter, B., … Zitkovich, B. (2023). RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control. arXiv preprint arXiv:2307.15818.

Das, A., Datta, S., Gkioxari, G., Lee, S., Parikh, D., & Batra, D. (2018). Embodied Question Answering. IEEE Conference on Computer Vision and Pattern Recognition (pp. 1-10).

Fan, C. (2019). EgoVQA - An Egocentric Video Question Answering Benchmark Dataset. IEEE/CVF International Conference on Computer Vision Workshop (pp. 4359-4366).

Gao, J., Sarkar, B., Xia, F., Xiao, T., Wu, J., Ichter, B., Majumdar, A., & Sadigh, D. (2023). Physically Grounded Vision-Language Models for Robotic Manipulation. IEEE International Conference on Robotics and Automation (pp. 12462-12469).

Garcia, E., Jimenez, M. A., De Santos, P. G., & Armada, M. (2007). The Evolution of Robotics Research. IEEE Robotics & Automation Magazine, 14(1), 90-103.

Gordon, D., Kembhavi, A., Rastegari, M., Redmon, J., Fox, D., & Farhadi, A. (2018). IQA: Visual Question Answering in Interactive Environments. IEEE Conference on Computer Vision and Pattern Recognition (pp. 4089-4098).

Haslum, P., Lipovetzky, N., Magazzeni, D., Muise, C., Brachman, R., Rossi, F., & Stone, P. (2019). An Introduction to the Planning Domain Definition Language (Vol. 13). Morgan & Claypool.

Hayes-Roth, F. (1985). Rule-Based Systems. Communications of the ACM, 28(9), (pp. 921-932).

Huang, W., Abbeel, P., Pathak, D., & Mordatch, I. (2022). Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents. International Conference on Machine Learning (pp. 9118-9147).

Huang, W., Xia, F., Xiao, T., Chan, H., Liang, J., Florence, P., & Hausman, K. (2023). Inner Monologue: Embodied Reasoning through Planning with Language Models. Conference on Robot Learning (pp. 1769-1782).

Jia, B., Lei, T., Zhu, S. C., & Huang, S. (2022). Egotaskqa: Understanding Human Tasks in Egocentric Videos. Advances in Neural Information Processing Systems, 35, (pp. 3343-3360).

Jiang, Y., Gupta, A., Zhang, Z., Wang, G., Dou, Y., Chen, Y., & Fan, L. (2022). VIMA: General Robot Manipulation with Multimodal Prompts. International Conference on Machine Learning, 40, (pp. 14975-15022).

Jiang, Y., Walker, N., Hart, J., & Stone, P. (2019). Open-World Reasoning for Service Robots. International Conference on Automated Planning and Scheduling (Vol. 29, pp. 725-733).

Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet Classification with Deep Convolutional Neural Networks. Advances in neural information processing systems, 25 (pp. 84-90).

Kwasny, S. C., & Faisal, K. A. (1990). Overcoming Limitations of Rule-Based Systems: An Example of a Hybrid Deterministic Parser. Konnektionismus in Artificial Intelligence und Kognitionsforschung: 6. Österreichische Artificial-Intelligence-Tagung (KONNAI) Salzburg, Österreich, 18.–21. September 1990 Proceedings (pp. 48-57).

LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep Learning. Nature, 521(7553), (pp. 436-444).

Leusmann, J., Wang, C., & Mayer, S. (2024). Comparing Rule-Based and LLM-Based Methods to Enable Active Robot Assistant Conversations. Workshop@CHI 2024: Building Trust in CUIs – From Design to Deployment.

Li, J., Tan, H., & Bansal, M. (2022). CLEAR: Improving Vision-Language Navigation with Cross-Lingual, Environment-Agnostic Representations. Findings of the Association for Computational Linguistics: NAACL 2022 (pp. 633-649).

Li, H., Li, J., & Han, X. (2019). Robot Vision Model Based on Multi-Neural Network Fusion. IEEE 3rd Information Technology, Networking, Electronic and Automation Control Conference (pp. 2571-2577).

Liang, J., Huang, W., Xia, F., Xu, P., Hausman, K., Ichter, B., & Zeng, A. (2023). Code as Policies: Language Model Programs for Embodied Control. IEEE International Conference on Robotics and Automation (pp. 9493-9500).

Lyu, C., Wu, M., Wang, L., Huang, X., Liu, B., Du, Z., & Tu, Z. (2023). Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and Text Integration. arXiv preprint arXiv:2306.09093.

Ma, Y., Song, Z., Zhuang, Y., Hao, J., & King, I. (2024). A Survey on Vision-Language-Action Models for Embodied AI. arXiv preprint arXiv:2405.14093.

Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., Petersen, S., Beattie, C., Sadik, A., Antonoglou, I., King, H., Kumaran, D., Wierstra, D., Legg, S., & Hassabis, D. (2015). Human Level Control through Deep Reinforcement Learning. Nature. 518(7540), (pp. 529-533).

Parvaneh, A., Abbasnejad, E., Teney, D., Shi, J. Q., & Van den Hengel, A. (2020). Counterfactual Vision-and-Language Navigation: Unravelling the Unseen. Advances in Neural Information Processing Systems, 33, (pp. 5296-5307).

Reed, S., Zolna, K., Parisotto, E., Colmenarejo, S. G., Novikov, A., Barth-Maron, G., Giménez, M., Sulsky, Y., Kay, J., Springenberg, J. T., Eccles, T., Valkov, L., Quillen, D., Hofer, L., Brandao, M., Hessel, M., & Küttler, H. (2022). A Generalist Agent. Transactions on Machine Learning Research. arXiv preprint arXiv:2205.06175.

Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017). Proximal p

Policy Optimization Algorithms. arXiv preprint arXiv:1707.06347.

Shridhar, M., Thomason, J., Gordon, D., Bisk, Y., Han, W., Mottaghi, R., & Fox, D. (2020). Alfred: A Benchmark for Interpreting Grounded Instructions for Everyday Tasks. IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 10740-10749).

Singh, I., Blukis, V., Mousavian, A., Goyal, A., Xu, D., Tremblay, J., Fox, D., Thomason, J., & Garg, A. (2023). Progprompt: Generating Situated Robot Task Plans Using Large Language Models. IEEE International Conference on Robotics and Automation (pp. 11523-11530).

Tan, H., Yu, L., & Bansal, M. (2019). Learning to Navigate Unseen Environments: Back Translation with Environmental Dropout. Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, (pp. 2610-2621).

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention Is All You Need. Proceedings of the 31st International Conference on Neural Information Processing Systems, 30, (pp. 6000-6010).

Wang, Z., Shen, R., & Stadie, B. (2024). Solving Robotics Problems in Zero-Shot with Vision-Language Models. arXiv preprint arXiv:2407.19094.

Wijmans, E., Datta, S., Maksymets, O., Das, A., Gkioxari, G., Lee, S., & Batra, D. (2019). Embodied Question Answering in Photorealistic Environments with Point Cloud Perception. IEEE/CVF Conference on Computer Vision and Pattern Recognition, (pp. 6659-6668).

Yenamandra, S., Ramachandran, A., Yadav, K., Wang, A. S., Khanna, M., Gervet, T., & Paxton, C. (2023). HomeRobot: Open-Vocabulary Mobile Manipulation. Conference on Robot Learning (pp. 1975-2011).

Yu, L., Chen, X., Gkioxari, G., Bansal, M., Berg, T. L., & Batra, D. (2019). Multi-Target Embodied Question Answering. IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 6309-6318).

Zhang, X., Ding, Y., Amiri, S., Yang, H., Kaminski, A., Esselink, C., & Zhang, S. (2023). Grounding Classical Task Planners via Vision-Language Models. arXiv preprint arXiv: 2304.08587.

Zhou, G., Hong, Y., Wang, Z., Wang, X. E., & Wu, Q. (2024). Navgpt-2: Unleashing Navigational Reasoning Capability for Large Vision-Language Models. European Conference on Computer Vision (pp. 260-278).

Zawalski, M., Chen, W., Pertsch, K., Mees, O., Finn, C., & Levine, S. (2024). Robotic Control via Embodied Chain-of-Thought Reasoning. In 8th Annual Conference on Robot Learning.

REVIEW