Recent collaborations of Idiap with industry has witnessed a large interest and potential of self-dependent smart sound devices to be deployed for security, surveillance, or emergency applications. CSEM developments in building occupancy detection and monitoring using embedded vision have led to the creation of a successful start-up (SNAP Sensor) now owned by Analog Devices (ADI). Great interest has been raised by the industry to cover the segment of elder care in order to detect and manage critical situations (falls, distress, …).
However, visual detection and localization of people in buildings finds limitations in many equivocal situations where the absence of motion or presence of artefacts are fooling the detection. Furthermore, the clues necessary to evaluate the criticality of a situation (people care) are often absent from the visual signal. The goal of the present project is to bring a complementary information from the analysis of ambient sound and combine it to the outcomes of the visual analysis. This will radically improve the robustness of the system (going multi-modal) and bring the necessary features (speech analysis, sound localisation, and automatic speaker and speech recognition) to address the considered market segments.
The combination of the visual and sound information will take place in an embedded platform providing industrial grade vision sensing together with a front-end comprising up to 4 microphones whose baseline will be compatible with system dimensions (max 15cm). A dedicated computing resource will be allocated to sound processing (resource sharing isn’t wise to address in a first project), that will be selected during the course of the project (depending on results of the algorithms optimization). CSEM will prepare an adequate platform to support the demonstration and provide expertise in visual analysis and data fusion.