Top

Robots learn without code, just by “surfing” on the Internet

Google has announced a new Artificial Intelligence system that could significantly contribute to the training of robots to understand everyday tasks, such as garbage disposal. Dubbed the Robotics Transformer 2 (RT-2), this model is trained on information and images from the internet, which can be “translated” into actions for robots, as reported by the tech giant.

While tasks like collecting garbage are relatively simple for humans, comprehending them for robots requires a series of steps, including object recognition for identifying items as trash and the ability to pick up and dispose of them.

Rather than programming a robot to perform these specific tasks, RT-2 allows the robot to “extract” knowledge from the internet to understand how to complete the job, even if it hasn’t been explicitly trained in the precise steps.

According to Google, the new model has nearly doubled the performance of robots compared to the previous version.

As reported by The New York Times, the company is not directly targeting widespread adoption or sales of robots with this new technology. However, these developments could be used in warehouses or even as household assistants.

A visual-language model (VLM) pre-trained on web-scale data is learning from RT-1 robotics data to become RT-2, a visual-language-action (VLA) model that can control a robot.
A visual-language model (VLM) pre-trained on web-scale data is learning from RT-1 robotics data to become RT-2, a visual-language-action (VLA) model that can control a robot.

Robots: What we know about RT-2

Robotic Transformer 2 (RT-2) represents an innovative vision-language-action (VLA) model that leverages data from the web and robotics to generate generalized instructions for controlling robots.

High-capacity vision-language models (VLMs) are typically trained on extensive web datasets, giving them remarkable proficiency in recognizing visual and linguistic patterns across various languages. However, for robots to attain a similar level of competence, they would need to accumulate firsthand robot data encompassing various objects, environments, tasks, and scenarios.

In an online announcement, Google introduced Robotic Transformer 2 (RT-2), a groundbreaking VLA model that learns from both web and robotics data and converts this knowledge into universal directives for robot control, all while preserving its web-scale capabilities.

“This work builds upon Robotic Transformer 1 (RT-1), a model trained on multi-task demonstrations, which can learn combinations of tasks and objects seen in the robotic data. More specifically, our work used RT-1 robot demonstration data that was collected with 13 robots over 17 months in an office kitchen environment,” Google pointed out.

“RT-2 shows improved generalization capabilities and semantic and visual understanding beyond the robotic data it was exposed to. This includes interpreting new commands and responding to user commands by performing rudimentary reasoning, such as reasoning about object categories or high-level descriptions”.

robots - RT-2
RT-2 architecture and training: We co-fine-tune a pre-trained VLM model on robotics and web data. The resulting model takes in robot camera images and directly predicts actions for a robot to perform.

The tech giant performed a series of qualitative and quantitative experiments on its RT-2 models on over 6,000 robotic trials.
According to Google, “RT-2 shows that vision-language models (VLMs) can be transformed into powerful vision-language-action (VLA) models, which can directly control a robot by combining VLM pre-training with robotic data. With two instantiations of VLAs based on PaLM-E and PaLI-X, RT-2 results in highly-improved robotic policies, and, more importantly, leads to significantly better generalization performance and emergent capabilities, inherited from web-scale vision-language pre-training”.
RT-2 is more than just a straightforward improvement on existing VLM models; it also demonstrates the potential to create a versatile physical robot capable of reasoning, problem-solving, and interpreting information for a diverse range of real-world tasks.

George Mavridis is a journalist currently conducting his doctoral research at the Department of Journalism and Mass Media at Aristotle University of Thessaloniki (AUTH). He holds a degree from the same department, as well as a Master’s degree in Media and Communication Studies from Malmö University, Sweden, and a second Master’s degree in Digital Humanities from Linnaeus University, Sweden. In 2024, he completed his third Master’s degree in Information and Communication Technologies: Law and Policy at AUTH. Since 2010, he has been professionally involved in journalism and communication, and in recent years, he has also turned to book writing.