Cognitive systems have always been a strong theme of the computing research community in general, and Idiap in particular. Until the last decade, such research tended to be placed under headings such as speech recognition or image processing. The scenarios were typically unimodal, with tight constraints on how a human could behave with respect to the computer. As we progress in time, however, our systems produce better and better results on evaluation databases, and we are able and obliged to move the goalposts. For example, speech recognition has to be able to deal with spontaneity, background noise, adaptation to the environment and task, as well as the multilingual aspects (too often underestimated, with main emphasis on English only). In robotic vision, also covered by the present project, computers have to be able to adapt to changing environments and extract relevant semantic information.
This project thus encompasses fundamental research aiming at the development of advanced techniques towards Interactive Cognitive Systems (computers and robots) for the processing and interpretation of cognitive audio and visual scenes. While being oriented to fundamental research, its core objective is the study of methods applied to the domains of activity of the Idiap Research Institute.
In the present proposal, we briefly describe four research projects that embody some of the challenges described above:
ICS-1: Robust privacy-sensitive audio features for interaction modeling. On one hand, advances in cognitive systems trigger more and more privacy preserving issues. On the other hand, it is also interesting to see how much information can be extracted about human-computer and human-human interaction by using audio features that fully preserve the privacy of the users (typically avoiding to extract lexical and identity information). Thus, this project investigates how to detect and model interaction, and how it relates to other aspects of natural human behaviour, based on privacy-preserving features only.
ICS-2: Multilingual speech recognition. The goal of this sub-project is to extensively investigate how to extrapolate Idiap’s leading edge in (English) speech recognition to multiple languages, including at least Swiss national languages. In this context, we are looking for principled approaches towards the definition and training of shared multi-lingual phone sets, fast adaptation of mono-lingual systems, or composition of multiple (mono-lingual) systems.
ICS-3: Learning semantic spatial concepts for mobile robots. In this sub-project, we investigate how a robot can adapt itself to a possibly changing environment. Rather than stick to static outdoor environments, we focus on an indoor home or office environment, where furniture and people move around. Although we are initially focusing on a computer vision modality, the work has the potential to diverge into audio based cognition.
ICS-4: Conversation analysis based on speaker diarization. Idiap has always been at the leading-edge in the area of speaker diarization (“Who spoke when”?). ICS-4 proposes a novel speaker diarization approach that is adaptive to its context, taking cues not only from the speakers themselves, but also from the higher semantic context available from dialogue and turn-taking.
The above sub-projects span the traditional cognitive spectrum of audio and video, but also include the emerging field of social cognition and should provide potential for strong interactions. This interaction will be encouraged through the use of common tasks and databases and common software.