Robots Begin to Develop Common Sense Knowledge

Research

Latest Industry News

'World's Purest Silicon' Could Lead to 1st Million-qubit Quantum Computing Chips

Hon Hai net profit rises 72% annually

IDC: Indian Smartphone Market Grew by 11% YoY in 1Q24 to 34 Million Units

Tim Cook is running out of ideas

Highly Modular EV/ICE Platform: Great Idea or Wishful Thinking?

MORE INDUSTRY NEWS

Robots Begin to Develop Common Sense Knowledge

MIT engineers are working to give robots a bit of “common sense” when faced with situations that push them off their trained paths.
Technology Briefing

Transcript

In other robot-related news, MIT engineers are working to give robots a bit of “common sense” when faced with situations that push them off their trained paths. To do so, they’ve developed a method that connects robot motion data with “common sense knowledge” contained in large language models, or LLMs.

Their approach enables a robot to logically parse many given household tasks into subtasks, and to physically adjust to disruptions within a subtask so that the robot can move on without having to go back and start a task from scratch, and without engineers having to explicitly preprogram fixes for every possible failure along the way. This is particularly valuable when leveraging “imitation learning which is becoming a mainstream approach for training household robots. If a robot is blindly mimicking a human’s motion trajectories, tiny errors can accumulate and eventually derail the rest of the execution. With MIT’s new method, a robot can self-correct execution errors and improve overall task success.”

The new approach is detailed in a study the researchers will present at the International Conference on Learning Representations in May 2024. They illustrate the new approach with a simple chore: scooping marbles from one bowl and pouring them into another. To accomplish this task, engineers would typically move a robot through the motions of scooping and pouring, as one fluid trajectory. They might do this multiple times, to give the robot a number of human demonstrations to mimic. However, the team realized that, while a human might demonstrate a single task in one go, the task actually depends on a sequence of subtasks.

For instance, the robot has to first reach into a bowl before it can scoop, and it must scoop up marbles before moving to the empty bowl, and so forth. If a robot is pushed or nudged to make a mistake during any of these subtasks, its only recourse is to stop and start from the beginning, unless engineers explicitly labeled each subtask and programmed the robot to recover from each possible failure, enabling the robot to self-correct in the moment. That level of planning is very tedious. Instead, the team found some of this work could be done automatically by an LLM.

These deep learning models process immense libraries of text, which they use to establish connections between words, sentences, and paragraphs. Through these connections, an LLM can then generate new sentences based on what it has learned about the kind of word that is likely to follow the last. For their part, the researchers found that in addition to sentences and paragraphs, an LLM can be prompted to produce a logical list of subtasks that would be involved in a given task.

For instance, if queried to list the actions involved in scooping marbles from one bowl into another, an LLM might produce a sequence of verbs such as “reach,” “scoop,” “transport,” and “pour.” LLMs have a way to tell you how to do each step of a task, in natural language. A human’s continuous demonstration is the embodiment of those steps, in physical space. And the researchers wanted to connect the two, so that a robot would automatically know what stage it is at in a task and be able to replan and recover on its own.

For their new approach, the team developed an algorithm to automatically connect an LLM’s natural language label for a particular subtask with a robot’s position in physical space or an image that encodes the robot state. Mapping a robot’s physical coordinates, or an image of the robot state, to a natural language label is known as “grounding.”

The team’s new algorithm is designed to learn a “grounding classifier,” meaning that it learns to automatically identify what semantic subtask a robot is in, given its physical coordinates or an image view. In the marble example, the subtasks might be “reach” versus “scoop.” The grounding classifier facilitates this dialogue between what the robot is doing in the physical space and what the LLM knows about the subtasks, as well as the constraints it needs to pay attention to within each subtask.

After a few demonstrations, the team used a pretrained LLM model to list the steps involved in scooping marbles from one bowl to another. The researchers then used their new grounding classifier algorithm to connect the LLM’s defined subtasks with the robot’s motion trajectory data. The algorithm automatically learned to map the robot’s physical coordinates in the trajectories and the corresponding image view to a given subtask.

The team then let the robot carry out the scooping task on its own, using the newly learned grounding classifiers. As the robot moved through the steps of the task, the experimenters pushed and nudged the robot off its path and knocked marbles off its spoon at various points.

Rather than stop and start from the beginning again, or continue blindly with no marbles on its spoon, the robot was able to self-correct, and completed each subtask before moving on to the next. For instance, it would make sure that it successfully scooped marbles before transporting them to the empty bowl.