Most efforts in robot learning from demonstration are turned toward developing algorithms for the acquisition of specific skills from training data. While such developments are important, they often do not take into account the social structure of the process, in particular, that the interaction with the user and the selection of the different interaction steps can directly influence the quality of the collected data. Similarly, while skills acquisition encompasses a wide range of social and self-refinement learning strategies, including mimicking (without understanding the objective), goal-level emulation (discovering the objectives by discarding the specific way in which a task is achieved), exploration with self-assessed rewards or feedback from the users, they each require the design of dedicated algorithms, but the ways in which they can be organized have been overlooked so far.
In ROSALIS, we propose to rely on natural interactions for skill learning, defined as an open-ended sequence of pragmatic frames, which are recurrent natural negotiated protocols which have emerged over time, involving queries about the skills and answers, including demonstrations made by both the human and the robot to show what it has learned. The research will advance on several fronts. First, for skills representation, the robot learners will require an appropriate level of plasticity, allowing them to adapt and refine a skill primitive currently being learned, as well as to freeze a skill primitive as part of a repertoire once the skill is mastered. Learning plasticity will be explored and mathematically formalized as a statistical invariance extraction problem that is concurrently applied to several levels of representation. Furthermore, active learning methodologies will be developed, relying on heterogeneous sources of information (demonstrations, feedback labels, properties), allowing to make hypotheses about the skill invariants and to suggest demonstrations or queries.
Secondly, to allow natural interactions, we will design perception algorithms to provide a higher level understanding of people behaviors and intentions, including gaze information and multimodal action recognition and segmentation. The different mechanisms will be integrated within the definition of the pragmatic frames and will imply the coordination (selection, timing) between different components: (i) real-time interpretation of the different multimodal inputs; (ii) synthesis of different demonstrations (primitives, partial or full instance gestures), as well as queries and social signals expressed through verbal (questions, grounding) and non-verbal behaviors (audio backchannels, head gestures and nodding, gaze behaviors); (iii) selection of the interaction units to build scaffolded interactions, exploiting hypotheses about the skill and allowing the system to combine different learning strategies. We target applications of robots in both manufacturing and home/office environments, both requiring re-programming in an efficient and personalized manner.