Unlike human communication, which involves a variety of nuances and subtleties, today’s robots understand only the literal. While they can learn by repetition, for machines, language is about direct commands, and they are pretty inept when it comes to complex requests. Even the seemingly minor differences between “pick up the red apple” and “pick it up” can be too much for a robot to decipher.
However, researchers from MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) want to change that. They think they can help robots process these complex requests by teaching machine systems to understand context.
In a paper they presented at the International Joint Conference on Artificial Intelligence (IJCAI) in Australia last week, the MIT team showcased ComText — short for “commands in context” — an Alexa-like system that helps a robot understand commands that involve contextual knowledge about objects in its environment.
Essentially, ComText allows a robot to visualize and understand its immediate environment and infer meaning from that environment by developing what’s called an “episodic memory.” These memories are more “personal” than semantic memories, which are generally just facts, and they could include information about an encountered object’s size, shape, position, and even if it belongs to someone.
When they tested ComText on a two-armed humanoid robot called Baxter, the researchers observed that the bot was able to perform 90 percent of the complex commands correctly.
“The main contribution is this idea that robots should have different kinds of memory, just like people,” explained lead researcher Andrei Barbu in a press release. “We have the first mathematical formulation to address this issue, and we’re exploring how these two types of memory play and work off of each other.”
BETTER COMMUNICATION, BETTER BOTS
Of course, ComText still has a great deal of room for improvement, but eventually, it could be used to narrow the communication gap between humans and machines.
“Where humans understand the world as a collection of objects and people and abstract concepts, machines view it as pixels, point-clouds, and 3-D maps generated from sensors,” noted Rohan Paul, one of the study’s lead authors. “This semantic gap means that, for robots to understand what we want them to do, they need a much richer representation of what we do and say.”
Ultimately, a system like ComText could allow us to teach robots to quickly infer an action’s intent or to follow multi-step directions.