Selected works
Recent selected works presented below represent Idiap's know-how and excellence range illustrated by its research activities.
A Variational AutoEncoder for Transformers with Nonparametric Variational Information Bottleneck
J. Henderson and F. Fehr., in Proc. of International Conference on Learning Representations (ICLR), Kigali, Rwanda, 2023
Novel idea of using Bayesian nonparametrics to model inference of attention-based representations. Variational AutoEncoders and their Variational Information Bottleneck have been limited to a single-vector latent space. Recent advances in deep learning have been dominated by Transformer architectures, whose latent space is an unboundedly large set of vectors. The authors combine the advantages of the former variational Bayesian models with those of the latter attention-based representations by modelling the latent space as a nonparametric space of mixture distributions, and using Dirichlet Processes to model distributions over these distributions. This idea has profound implications for our understanding of deep learning models and the cognitive tasks they excel at, such as Large Language Models. This work was done in the SNSF NCCR Evolving Language.
Measuring Linkability of Protected Biometric Templates Using Maximal Leakage
H. O. Shahreza, Y. Y. Shkel and S. Marcel, IEEE Transactions on Information Forensics and Security, 18(7): 2262 - 2275, 2023
A new metric to measure the unlinkability of biometric templates. The proposed method is based on maximal leakage, a well-studied measure in information-theoretic approaches. The resulting linkability measure has a number of important theoretical properties and an operational interpretation used to evaluate the linkability of protected biometric templates from different biometric characteristics (face, voice, and finger vein). This work is a joint Idiap-EPFL collaboration and is the outcome of the H2020 MSCA-ITN-ETN TRaining in Secure and PrivAcy-preserving biometricS (TReSPasS) project.
Hybrid Autoregressive Solver for Scalable Abductive Natural Language Inference.
M. Valentino, M. Thayaparan, D. Ferreira, A. Freitas, In Proc. of 36th Conf. on Artificial Intelligence (AAAI), 2022.
A novel approach that enables expert systems to perform flexible, natural-language-based explanatory reasoning over arbitrarily large knowledge bases of facts. This allowed for bridging the gap between natural language inference and expert systems, which are traditionally semantically brittle, delivering models which can operate over natural language statements. The model is scalable and applicable to different domains that require the close integration with distributed evidence, such as evidence-based medicine, legal-case reasoning, evidence-based policy making and fact-checking.
Geometric Algebra for Optimal Control with Applications in Manipulation Tasks
T. Löw and S. Calinon, IEEE Transactions on Robotics, DOI: 10.1109/TRO.2023.3277282, 2023
Geometric algebra provides a unification of various frameworks such as screw theory, Lie algebra or dual quaternions, while offering greater generalization and extension perspectives. In robotics, it provide a single algebra for geometric reasoning, alleviating the need of utilizing multiple algebras to express geometric relations. Geometric algebra operations typically allow translations and rotations to be treated in the same way, without requiring us to switch between different algebras. It allows geometric operations to be computed in a very fast way, with compact codes, which is demonstrated in learning and control in robot manipulation. This work was supported by SERI for the participation in INTELLIMAN and SESTOSENSO projects (Horizon Europe).
Predicting is not Understanding: Recognizing and Addressing Underspecification in Machine Learning
D. Teney, M, Peyrard, and E. Abbasnejad, In Proc. of European Conf. on Computer Vision (ECCV), pp. 458–476, 2022.
An Idiap-EPFL collaboration on the fundamental limitations of statistical learning, which showed that the ubiquitous objective of maximizing predictive performance is often at odds with desirable properties such as fairness, interpretability, and robustness to changing conditions. The authors proposed new algorithms to identify and quantify situations where the data is insufficient to constrain the solution space to ensure such properties. This work is relevant to virtually all prevailing approaches to machine learning. By acknowledging and identifying the limitations of current models, these results can help practitioners ensure accountability in AI systems, fostering trust and responsible deployment of machine learning technologies across various domains.
Diff-Explainer: Differentiable Convex Optimization for Explainable Multi-hop Inference
M. Thayaparan, M. Valentino, D. Ferreira, J. Rozanova, A. Freitas, Transactions of the Association for Computational Linguistics, 10: 1103–1119, 2022
A controllable abductive neuro-symbolic solver for the integration of the semantic flexibility of language models. The solver encodes complex explanatory inference patterns, facilitating the model to deliver better generalisation and inference control. The model allows for a declarative programming framework that can control language models to deliver explanatory and argumentative inferences as multi-step reasoning over large knowledge bases of facts in areas such as biomedical, investment and legal reasoning through abstracting and unifying facts.
A surrogate gradient spiking baseline for speech command recognition
A. Bittar and P. N. Garner, Frontiers in Neuroscience, 16, Aug. 2022
A study of biologically plausible spiking neural networks with two contributions: One is that spiking neurons can be trained using surrogate gradients in the same framework as more conventional artificial neurons. This means that the two paradigms can be freely mixed, easing the transition to spiking for cases where it is attractive. The second contribution is showing that such hybrid networks can lead to performance similar to the state of the art on a modest speech processing task, suggesting that more involved tasks are within reach. One corollary is an indication of how much power could be saved in hardware implementations of such networks. This work is an outcome of the SNSF project Neural Architectures for Speech Technology (NAST).
Image-based Deep Learning Reveals the Responses of Human Motor Neurons to Stress and VCP-related ALS
C. Verzat, J. Harley, R. Patani, R. Luisier, Neuropathology and Applied Neurobiology, 48(2): e12770, 2022
Amyotrophic lateral sclerosis (ALS) is a rapidly progressive and incurable neurodegenerative disease, for which the early cellular and molecular events remain poorly understood. Morphological attributes of cells and their substructures have been relatively understudied in ALS research. Transfer learning was used to leverage the power of imaging fluorescent data to enable unbiased, robust and efficient testing of biological hypotheses, and to resolve the extent of aberrant cellular morphological indices during earlier phases of ALS pathogenesis. This work, performed in collaboration with the Francis Crick Institute, opens new perspectives on the information that can be gleaned from image analysis.
Generalization and Personalization of Mobile Sensing-Based Mood Inference Models: An Analysis of College Students in Eight Countries
L. Meegahapola, [...], D. Gatica-Perez, PACM on Interactive, Mobile, Wearable, and Ubiquitous Technologies (IMWUT), Vol. 6, No. 4, Art. 176, pp. 1-32, Dec. 2022
A multidisciplinary approach with a large international team to quantify the performance of ML-based, mood inference models from real-life smartphone sensor data, investigating country-specific, country-agnostic, and multi-country approaches. This work is a unique example of progress towards the development of diversity-aware algorithms for people in multiple world regions while involving hundreds of university student participants in eight countries in Europe, Latin America, and Asia (specifically China, Denmark, India, Italy, Mexico, Mongolia, Paraguay, and the UK). This work was an outcome of the H2020 project WeNet.
Adjustable Deteministic Pseudonymization of Speech
S. Pavankumar Dubagunta, R. J. J. H. van Son and M. Magimai-Doss, Computer Speech & Language, 2022
A novel speech pseudonymization (reversible anonymization) where fundamental frequency (voice source information), formant frequency (vocal tract information) and rate-of-speech are altered to obfuscate speaker identity. The approach was validated through a ABX listening experiment and by participating in the first Voice Privacy challenge that was organized in 2020. The approach preserves paralinguistic information related to dysarthria. This work is an outcome of the HASLER Foundation project FLOSS and Innosuisse project CMM.
Active Learning by Feature Mixing
A. Parvaneh, E. Abbasnejad, D. Teney, G. R. Haffari, A. Van Den Hengel, & J. Q. Shi. , In Proc. of IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pp. 12227-12236, 2022.
A method to train deep learning models with humans in the loop. Current approaches to machine learning depend on large amounts of data that are costly or difficult to acquire. This paper presents an active learning approach where human experts interact with the learning algorithm to iteratively refine and resolve inconsistencies in a model by labeling a small set of training examples. This approach contributes to widening the accessibility of machine learning technologies to small organizations.
Image Retrieval on Real-life Images with Pre-trained Vision-and-language Models
Z. Liu, C. Rodriguez-Opazo, D. Teney, & S. Gould , In Proc. of Int. Conf. on Computer Vision (ICCV), pp. 2105-2114, 2021.
New methods and benchmarks for language-based interfaces to access large datasets of images. The availability of ever-larger amounts of visual data in commercial, scientific, and medical domains poses challenges to search or efficiently retrieve relevant content. This work goes beyond traditional keyword searches, allowing for multimodal queries, e.g., a dog similar to this one but running on grass. Such interactions expand the accessibility of modern technologies to non-experts and continue Idiap's long expertise in multimodal interfaces.
A Differential Approach for Gaze Estimation
G. Liu, Y. Yu, K. Funes and J.-M. Odobez, IEEE Transaction of Pattern Analysis and Machine Intelligence, 43(3): 1092-1099, 2021
A novel approach for non-invasive gaze estimation based on a differential neural network to predict gaze differences across two images of the same eye of a subject. The approach strongly reduces the negative effects of illumination and shadows, specular reflections, morphological differences across humans, using only one calibration sample. The method also allows for neural network fine-tuning from subject-specific calibration images, further enhancing prediction consistency and accuracy. This work is an outcome of the Innosuisse REGENN project and the idea was patented by Eyeware SA, an Idiap spinoff company which continue to collaborate with Idiap through the ongoing Eurostar ePartner project.
Neural Network Adaptation and Data Augmentation for Multi-Speaker Direction-of-Arrival Estimation
W. He, P. Motlicek and J.-M. Odobez, IEEE/ACM Transactions on Audio, Speech and Language Processing, 29:1303-1317, 2021
The first viable deep learning framework (task definition, network architecture, training paradigm) for solving fundamental auditory tasks such as sound source localisation, speaker identification and speech/non-speech classification. The framework is suitable for highly noisy environments and overcomes limitations of previous methods, which heavily relied on idealized sound and environment models and are inadequate for everyday situations with multiple sound sources, background noise, short utterances, and lack of prior knowledge of the number of sound sources. The method learns sound source localization models with limited training resources leveraging simulated and weakly-labelled real audio data.
Recursive Non-Autoregressive Graph-to-Graph Transformer for Dependency Parsing with Iterative Refinement
A. Mohammadshahi and J. Henderson, Transactions of the Association for Computational Linguistics, 9:120–138, 2021
A novel deep learning architecture for modeling graphs, which extends the Transformer architecture with the input and output of graphs through the attention mechanism, and iteratively refines the output graph. This work exploits the insight that attention-based models like transformers are induced-graph models, not just sequence models. This work is an outcome of SNSF Sinergia grant Automated Interpretation of Political and Economic Policy Documents: Machine Learning using Semantic and Syntactic Information.
Locally Private Graph Neural Networks
S. Sajadmanesh and D. Gatica-Perez, in Proc. ACM Conf. on Computer and Communications Security (CCS), Seoul, Nov. 2021
An original solution for privacy-preserving, architecture-agnostic learning on graph neural networks with formal privacy guarantees based on local differential privacy. This approach can be used for applications with human-related variables that involve sensitive or personal information. This work is an outcome of the SNSF Sinergia project Dusk2Dawn (Characterizing Youth Nightlife Spaces, Activities, and Drinks). This research attracted attention from the ML & Privacy community, as shown by invited talks at Twitter (Jan 2021), U. Illinois (Oct. 2022) and Imperial-X (Mar. 2023).
A Bayesian approach to recurrence in neural networks
P.N. Garner and S. Tong, IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(8):2527-2537, August 2021
A principled way to handle recurrence in neural networks that reduces substantially the number of parameters. This rigorous approach starts from a Bayesian interpretation of the recurrent unit, and continues via a Bayesian interpretation of what the recurrence means. The formulation dictates a structure for the recurrence that has a physical interpretation and enables a backward pass in the same sense as that in a Kalman smoother; this in turn formalizes the concept of bidirectional recurrence. This work is an outcome of the SNSF project Neural Architectures for Speech Technology (NAST).
Deep learning architectures for estimating breathing signal and respiratory parameters from speech recordings
V. S. Nallanthighal, Z. Mostaani, A. Härma, H. Strik, and M. Magimai-Doss, Neural Networks, 2021
Speech and respiration are closely related, as breathing is a primary mechanism of speech production. In this paper, we conducted a comprehensive study to explore deep learning techniques for sensing breathing signal and breathing parameters from speech, and address the challenges involved in establishing a practical purpose of this technology. We validated the developed approaches on two corpora through single- and cross-corpora studies. This paper is a result of a collaboration between Idiap and Philips Research Eindhoven partially funded by H2020 project TAPAS and SNSF project TIPS. This work has potential implications for the development of novel speech-based applications for healthcare, such as breathing monitoring in telehealth care applications, and assessment for early recognition of abnormal breathing syndromes.
Cross Modal Focal Loss for RGBD Face Anti-Spoofing
A. George and S. Marcel, IEEE Computer Vision and Pattern Recogniton (CVPR), 2021
A new loss function for deep learning-based multi-spectra face presentation attack detection. The proposed method modulates the loss contribution of each channel as a function of the confidence of individual channels. This work is outcome of the IARPA program ODIN “Biometrics Presentation Attack Detection” project “Biometric Authentication with Timeless Learner” (BATL) and gained visibility in the industry. Part of this work led to the grant Meta-Spoof funded by Meta to investigate ocular presentation attack detection in VR/AR devices.
Spatially-Variant CNN-Based Point Spread Function Estimation for Blind Deconvolution and Depth Estimation in Optical Microscopy
A. Shajkofci and M. Liebling, IEEE Transactions on Image Processing, 29, 5848–5861, 2020
Optical microscopy is a central tool for biomedical research and diagnostics, which calls for instruments performance to be continuously improved. We leverage a learning approach to locally determine image distortion parameters in the form of a parametric point spread function without requiring instrument- or object-specific calibration. This approach is robust to photon noise and is a key element for downstream applications, such as spatially variant deconvolution, depth localization, and flow estimation. This work is an outcome of the SNSF project Computational biomicroscopy: advanced image processing methods to quantify live biological systems (2018-2022).
Biometric Face Presentation Attack Detection with Multi-Channel Convolutional Neural Network
A. George, Z. Mostaani, D. Geissenbuhler, O. Nikisins, A. Anjos and S. Marcel, IEEE Transactions on Information Forensics and Security, 15:42-55, 2020
A deep learning-based multi-spectral fusion approach for visible spectrum, near infrared, short-wave infrared, thermal and depth data that enables robust face presentation attack detection using hardware and software components. This work is an outcome of the IARPA program ODIN “Biometrics Presentation Attack Detection” project “Biometric Authentication with Timeless Learner” (BATL) and enabled a follow-up InnoSuisse grant, “Contactless finger vein recognition and presentation attack detection on-the-fly” and the expertise was transfered to the Swiss company Global ID. The University of Southern California re-used the proposed technologies for iris and fingerprint presentation attack detection and the work led to a joint patent application (USC0289PUSP1 patent pending).
Spectro-temporal Sparsity Characterization for Dysarthric Speech Detection.
I. Kodrasi and H. Bourlard , IEEE/ACM Trans. on Audio, Speech, and Language Processing, 28, pp. 1210-1222, 2020.
Dysarthria, which can result from Parkinson's disease, stroke, or cerebral palsy, is a motor disorder that leads to imprecise articulation of speech sounds and words. This work demonstrates that the difference between healthy control speech and dysarthric speech can be characterized by estimation of spectro-temporal sparsity measures, and dysarthria can be effectively detected. The detection of dysarthria can help in clinical diagnosis and treatment.
End-to-End Acoustic Modeling using Convolutional Neural Networks for HMM-based Speech Recognition
D. Palaz, M. Magimai-Doss and R. Collobert, Speech Communication, 2019
This work challenged conventional thoughts in the speech community emerging from speech coding days like extraction of information through analysis of 20-30 ms speech (in signal processing terms window size). This work also provided a signal theoretic interpretation of information learned by the neural network between the first and second convolution layers (providing a link to sparse coding). This paper is the winner of the 2023 EURASIP Best Paper Award for Speech Communication journal. This paper is an outcome of HASLER Foundation project DeepSTD.
Heterogeneous face recognition using domain specific units
T. de Freitas Pereira, A. Anjos, and S. Marcel, IEEE Transactions on Information Forensics and Security, 14(7): 1803-1816, July 2019
An approach for transfer learning across domains and spectra (colour, near infrared, drawing or sketches). This work hypothesised that low-level layers from a Deep Convolutional Neural Network are domain specific as opposed to high-level layers (deep layers) that are domain independent. We demonstrated that a pre-trained model for a given domain (e.g. colour) can be adapted to another domain (e.g. near infrared or thermal) by only fine-tuning a few parameters in the low-level layers, or domain specific units. This work is an outcome of the SNSF project Heterogenous Face Recognition (HFACE). This work led to the InnoSuisse HARDENING (and the subsequent technology transfer to a Swiss company) and to face recognition at a distance for a project funded by the IARPA BRIAR program.
HeadFusion: 360-degree Head Pose tracking combining 3D Morphable Model and 3D Reconstruction
Y. Yu, K. Funes and J.-M. Odobez, IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(1): 2653-2667, Nov. 2018
An accurate and robust view-independent tracker following natural head dynamics with fast accelerations and using symmetry regularization to handle faces dominantly seen from one side only. The core idea is the combination of the accuracy of semantic morphable models automatically fitted to individual faces (which are not used to model the full head given the difficulty of building statistical models of heads) with the view-independent tracking robustness of a full head shape model automatically reconstructed during tracking as well as visual tracking to address natural head dynamics. This work is an outcome of the SNSF UBImpressed Sinergia project.