[Forward Thinking] Foundation Models – Connecting AI to the Physical World
2023.12.06
Generative AI is changing the world. To be more precise, it is changing the ‘digital world.’
What are these changes we have recently been facing? There are chatbots that provide us with natural and human-like answers to any questions, 24 hours a day, 7 days a week. There are image generation AIs that push the boundaries of creativity for artists and designers, while opening up entirely new experiences for people who are less talented when it comes to drawing. Thanks to AI-powered avatars, we are now capable of expressing ourselves even more realistically in the virtual world. These are three examples of how AI is changing the digital world by significantly extending our own capabilities and enriching the way we operate and communicate.
There is, however, another world that we should be paying attention to: the physical world we live in. Generative AI can be applied in the physical world as well – and it might even hold greater potential here. Let me introduce our research on this today.
AI for the Physical World
NAVER LABS’ main research target has been connecting AI to the physical world. The research fields we are primarily focusing on are: “action” - robots performing tasks in various environments; “vision” - robots perceiving and understanding different environments; and “interaction” - robots interacting and collaborating with humans.
One of the most notable achievements from these works is NAVER’s second headquarters, “1784.” Here, 100 robots provide a variety of robot services to the employees like serving coffee or delivering packages. AI technology plays an essential role in providing such services as it endows robots with the necessary smartness and more versatility.
Going beyond buildings, we have also been expanding our technology to cover entire cities. Our digital twin technology allows us to replicate vast physical spaces in the digital world. AI’s ability to comprehend the complex physical world is critical for digital twins, and we use this ability everywhere, from service robots, city simulators and self-driving vehicles, to AR navigation.
As one can tell, AI is already playing a crucial part in enabling a better understanding of the physical world. To further accelerate this, about 2 years ago, we began focusing on a new approach to AI — foundation models.
The Transition to Foundation Models
In 2021, at NAVER LABS Europe (NLE), we made an important decision. We decided to focus our projects on the creation and utilization of foundation models. A foundation model is a comprehensive model with broad knowledge, trained on a massive amount of data, that can be adapted or fine-tuned towards a specific task. While we were confident that this methodology would become mainstream, it was still a difficult decision to make considering the substantial transition costs.
Why did we push through? To explain the reason for this transition, we must discuss the limitations of traditional approaches to AI.
Traditional approaches to AI typically involve identifying a specific problem, collecting relevant data, and training a neural network to find a solution. As effective as this is, the resulting AI models are difficult to apply to real-world situations. First, model performance often decreases when shifting from training to deployment. Second, from a business perspective, it can be difficult to develop a model for the diverse needs of individual users.
To better illustrate this challenge, let’s take AI developed for robots as an example. Technologically speaking, it is extremely challenging to get a robot to autonomously perform a task in the physical - the real - world as it is a highly uncontrolled environment. “Uncontrolled” here refers to an environment that is impossible to accurately predict. The more complex the task that a robot performs in an unpredictable environment, the more training data is required by the traditional approach to AI, to ensure the model for that task covers all the possible outcomes. In practice, this significantly limits the complexity of the tasks that robots can execute autonomously, if they are trained with the traditional approach to AI.
Foundation models are a promising solution to the problems of uncontrolled environments, as they significantly reduce the effort required to support new tasks and because they can cope with a wide range of situations thanks to their broad knowledge.
The decision to focus on foundation models helped to increase the performance of our models, but more importantly, research on foundation models enabled researchers from different domains to create and leverage synergies.
AI, Overcoming Complexities of the Real-World
Unfortunately, the complexity of the physical world still persists and poses challenges to the creation of foundation models. First, it is not easy to obtain the necessary data for such models in the physical world. Second, even if we somehow acquire such data, the real world constantly changes — requiring the data to be quickly updated accordingly. The latter requires mechanisms that allow fast transfer between training and deployment: a challenge for very large models. For these and other reasons, research on foundation models for the physical world seems to remain in the early stages, as compared to the numerous accomplishments of foundation models in the digital world.
Luckily for us, we have a decent head start, with robots constantly operating in everyday spaces in our massive testbeds (e.g. 1784), and considerable advantages in dataset creation and updating. Owing to these strengths, we work on multiple projects related to foundation models in the physical world, one of which is our 3D vision foundation model for robots and digital twin, CROCO.
Short for “cross-view completion,” CROCO understands the world in 3D by looking at millions of image pairs, showing the same scene from two different viewpoints. CROCO builds its broad knowledge by learning how to reconstruct one image from the other - a task that is only possible if the model understands 3D. Once we have taught CROCO to understand the world in 3D, we can fine-tune it for downstream tasks.
Using CROCO, we can improve robots’ adaptability to the complex physical world. Think about multiple service robots connected to the cloud, and as such, having access to foundation models. By easily adapting to changes in the environment, robots can move beyond familiar areas to provide services in many different places, potentially even in places they have never seen before.
In addition, foundation models like CROCO can be a great way to increase the efficiency and performance of large-scale robot mobility, where problems such as multi-robot routing, scheduling, logistics, navigation, and so on cannot be solved using traditional AI approaches.
Ultimately, we aim to enable robots to explore spaces like humans do, and not by solely relying on precise maps.
We can also expect huge shifts in the way robots interact with humans. For robots, we humans are extremely complicated objects to understand. Imagine if we could get robots to comprehend human behavior and intent. This would allow the creation of robot services that better understand people and interact in a safer manner.
Needless to say, this research is critical for the popularization of robots.
From a Single Solution to the Bigger Future
We still have a number of problems to solve to ensure that robots operate better in the physical world, but through foundation models, the fundamental approach to these challenges has changed. Instead of a single model addressing a single problem, the foundation model approach allows a model to become a good starting point to quickly and effectively address other problems.
What does this entail for the future? A world where 1000s of robots operate in our physical space, solving 1000s of tasks in parallel.
With NAVER LABS’ broad AI and robotics knowledge, added into our pipeline of real-world robots operating in 1784 and the 2nDC, our technology goes beyond the research lab.
We have been working and will continue to work on making robots useful to everyone in their everyday lives.
▶︎ Martin Humenberger is the scientific leader of NAVER LABS Europe, which is the largest industrial AI research lab in France. NLE is continuing its research on foundation models to enable robots to understand real-world spaces and effective interaction with humans.
▶︎Forward Thinking is a recurring online publication focusing on today’s major technological trends, such as AI, robotics, autonomous driving, and metaverses, containing stories of outstanding researchers collaborating with NAVER LABS. www.naverlabs.com/en/forwardthinking