This will be my 4th substack following the idea: of "Understanding AI," for non-technical founders. During the past weeks, a handful of people have reached out, looking to learn forward about how they can leverage AGI in their day-to-day work. Even though it is exciting to realize that there are many people out looking forward to implementing AGI capabilities into their value chain, sometimes it feels like we only have the chatbot use case as our way to go. My first thought regarding a conversational bot as the most adopted AGI use case is, yeah. But what about the robots? I mean I didn’t just watch Terminator Saga just for nothing.
Probably one of my first geeky Silicon Valley moments was in early 2017 when I spotted a Waymo driving itself around. Earlier this year I had the chance of being one of the first users of Waymo to take it into a freeway. I can tell that the car drives itself better than 90% of the people I’ve ever jumped into their car. Today I want to unfold a little bit about the technology behind Waymo and try to relate it or compare it to Mobility VLA: Multimodal Instruction Navigation with Long-Context VLMs and Topological Graphs. A paper presented today by Google, Waymo’s Sister company. VLM is a technological framework to give voice instructions to a robot based on the environment.
Waymo’s Stack
Waymo's technology is a suite of sensors that provide a comprehensive view of the vehicle's surroundings. Lidar, a laser-based detection system, paints a detailed 3D picture of the environment, while cameras and radar provide additional information about objects, their distance, and speed. This sensor data is processed by a powerful computing system based on Intel CPUs and GPUs, enabling real-time decision-making. Waymo's custom software is designed to interpret sensor data and plan safe routes for the vehicle. Machine learning algorithms are used for perception, prediction, and planning. The perception system can identify pedestrians, cyclists, vehicles, and other objects with accuracy. Prediction models anticipate the possible paths of other road users based on millions of miles of real-world driving and billions of miles in simulation. The planning system then determines the safest trajectory, speed, lane, and steering maneuvers to reach the destination safely. To further refine its autonomous driving capabilities, Waymo has developed a lightweight, hardware-accelerated simulator called Waymax. This tool allows the company to test its AI systems in realistic driving scenarios, leveraging data from the Waymo Open Motion Dataset to model the behaviors of other road users. As part of the Internet of Things (IoT) ecosystem, Waymo's vehicles are connected and communicate with each other and the infrastructure around them. This integration enables intelligent mobility with higher levels of safety for passengers and pedestrians.
Exploring Robotics and AI: Mobility VLA
The paper discusses a fascinating new development in robot navigation called Mobility VLA, which stands for Vision-Language-Action navigation. This innovative approach integrates advanced AI models with practical navigation techniques to create robots that can understand and follow complex instructions in real-world environments. Mobility VLA is designed to tackle what researchers call Multimodal Instruction Navigation (MINT). Imagine a robot that not only understands spoken instructions but also processes visual cues from its environment. For instance, if you ask, "Where should I return this?" while holding an object, the robot can interpret your request using both your words and the visual context, and then guide you to the correct location.
Here's how it works:
High-Level Understanding: The robot uses a sophisticated Vision-Language Model (VLM) to interpret a demonstration tour video of the environment. This tour can be as simple as a video you take while walking around your office or home. The VLM processes this video along with your multimodal instructions (both verbal and visual) to identify the goal location.
Low-Level Navigation: Once the high-level model identifies the goal, the robot relies on a topological graph created from the demonstration tour. This graph acts as a map, guiding the robot's movements step-by-step to reach the destination. The robot makes real-time decisions at every turn, ensuring it navigates accurately and efficiently.
In practical terms, this means significant improvements in how robots can assist us. Imagine a robot in a busy office that not only finds its way to a designated desk but also adapts to changes and obstacles in the environment. Or think about a home assistant that helps you find misplaced items based on your descriptions and the current layout of your rooms.
One of my hunches is, that maybe Google is not making that much noise in the AGI narrative. Maybe is related because they’ve failed in their attempts. But Google has been a key player in Robotics for a long time. Is there a chance that Alphabet’s end game is to become the largest robot manufacturer and Software company?
PD Here is a today’s Waymo