PhD Project Abstract

There is an increased interest in artificially intelligent technology that surrounds us and takes decisions on our behalf. This creates the need for such technology to be able to communicate with humans and understand natural language and non-verbal behaviour that may carry information about our complex physical world. Artificial agents today still have little knowledge about the physical space that surrounds us and about the objects or concepts within our attention. We are still lacking computational methods in understanding the context of human conversation that involves objects and locations around us. Can we use multimodal cues from human perception of the real world as an example of language learning for robots? Can artificial agents and robots learn about the physical world by observing how humans interact with it and how they refer to it and attend during their conversations? This PhD project’s focus is on combining spoken language and non-verbal behaviour extracted by multi-party dialogue in order to increase context awareness and spatial understanding for artificial agents.

Multimodal Human-Robot Interaction


Multimodal Reference Resolution

Modelling visual attention and natural language references to objects. We developed a likelihood model of object saliency given the proportion of eye gaze to objects during referring expressions (top right corner).
The eye gaze in 3d space is automatically detected and annotated to the available targets in the room. The speech transcriptions and syntactic parsing of natural language are also displayed in real time (top left corner).

Humans use pragmatic feedback to disambiguate references to objects in the shared space of attention. How do robots disambiguate object references to resolve word-referent mapping? We use humans’ eye-gaze direction and deictic expressions to obtain knowledge about the world surrounding the robot.

Multimodal Reference Resolution in Collaborative Assembly Tasks



The focus of FACT is on providing safe and flexible feedback in unforeseen situations, enhancement of human-robot cooperation and learning from experience. The project will develop a robot that intelligently assists workers during production and the interaction is based on visual feedback and natural communication.


The crowning achievement of human communication is our unique ability to share intentionality, create and execute on joint plans. Using this paradigm we model human-robot communication as a three-step process: sharing attention, establishing common ground and forming shared goals. Prerequisites for successful communication are being able to decode the cognitive state of people around us (intention-reading) and to build trust.