# Swiss Machine Learning Day 2019

The Swiss Machine Learning Day is a one-day workshop organized every year since 2012, which aims at bringing together Swiss researchers working on topics related to machine learning.

This 8th instance of the workshop took place at the SwissTech Convention Center, on Wednesday November 13th, 2019.

We awarded this year three prizes sponsored by Swisscom:

Organized by François Fleuret and Martin Jaggi.

# Abstracts

## Full-Gradient Representation for Neural Network Visualization

Suraj Srinivas (Idiap Research Institute), Francois Fleuret (Idiap Research Institute) [slides]

We introduce a new tool for interpreting neural net responses, namely full-gradients, which decomposes the neural net response into input sensitivity and per-neuron sensitivity components. This is the first proposed representation which satisfies two key properties: completeness and weak dependence, which provably cannot be satisfied by any saliency map-based interpretability method. For convolutional nets, we also propose an approximate saliency map representation, called FullGrad, obtained by aggregating the full-gradient components. We experimentally evaluate the usefulness of FullGrad in explaining model behaviour with two quantitative tests: pixel perturbation and remove-and-retrain. Our experiments reveal that our method explains model behaviour correctly, and more comprehensively than other methods in the literature. Visual inspection also reveals that our saliency maps are sharper and more tightly confined to object regions than other methods.

## Polynomial Escape-Time from Saddle Points in Distributed Non-Convex Optimization

Stefan Vlaski (EPFL), Ali H. Sayed (EPFL)

The diffusion strategy for distributed learning from streaming data employs local stochastic gradient updates along with exchange of iterates over neighborhoods. In this work we establish that agents cluster around a network centroid in the mean-fourth sense and proceeded to study the dynamics of this point. We establish expected descent in non-convex environments in the large-gradient regime and introduce a short-term model to examine the dynamics over finite-time horizons. Using this model, we establish that the diffusion strategy is able to escape from strict saddle-points in O(1/μ) iterations, where μ denotes the step-size; it is also able to return approximately second-order stationary points in a polynomial number of iterations. Relative to prior works on the polynomial escape from saddle-points, most of which focus on centralized perturbed or stochastic gradient descent, our approach requires less restrictive conditions on the gradient noise process.

## Subspace Networks for Few-shot Classification

Arnout Devos (EPFL), Matthias Grossglauser (EPFL)

We propose subspace networks for the problem of few-shot classification, where a classifier must generalize to new classes not seen in the training set, given only a small number of examples of each class. Subspace networks learn an embedding space in which classification can be performed by computing distances of embedded points to vector subspace representations of each class. The class subspaces are spanned by examples belonging to the same class, transformed by a learnable embedding function. Similarly to recent approaches for few-shot learning, subspace networks reflect a simple inductive bias that is beneficial in this limited-data regime and they achieve excellent results. In particular, our proposed method shows consistently better performance than other state-of-the-art few-shot distance-metric learning methods when the embedding function is deep or when training and testing domains are shifted.

## Towards Integration of Statistical Hypothesis Tests into Deep Neural Networks

Ahmad Aghaebrahimian (Zurich University of Applied Sciences), Mark Cieliebak (Zurich University of Applied Sciences)

We report our ongoing work about a new deep architecture working in tandem with a statistical test procedure for jointly training texts and their label descriptions for multi-label and multi-class classification tasks. A statistical hypothesis testing method is used to extract the most informative words for each given class. These words are used as a class description for more label-aware text classification. Intuition is to help the model to concentrate on more informative words rather than more frequent ones. The model leverages the use of label descriptions in addition to the input text to enhance text classification performance. Our method is entirely data-driven, has no dependency on other sources of information than the training data, and is adaptable to different classification problems by providing appropriate training data without major hyper-parameter tuning. We trained and tested our system on several publicly available datasets, where we managed to improve the state-of-the-art on one set with a high margin and to obtain competitive results on all other ones.

## On the Relationship between Self-Attention and Convolutional Layers

Jean-Baptiste Cordonnier (EPFL), Andreas Loukas (EPFL), Martin Jaggi (EPFL) [slides]

Recent trends of incorporating attention mechanisms in vision have led researchers to reconsider the supremacy of convolutional layers as a primary building block. Beyond helping CNNs to handle long-range dependencies, Ramachandran et al. (2019) showed that attention can completely replace convolution and achieve state-of-the-art performance on vision tasks. This raises the question: do learned attention layers operate similarly to convolutional layers? This work provides evidence that attention layers can perform convolution and, indeed, they often learn to do so in practice. Specifically, we prove that a multi-head self-attention layer with sufficient number of heads is at least as powerful as any convolutional layer. Our numerical experiments then show that the phenomenon also occurs in practice, corroborating our analysis.

## Multimodal Generative Learning Utilizing Jensen-Shannon-Divergence

Thomas Sutter (ETH Zurich), Imant Daunhawer (ETH Zurich), Julia Vogt (ETH Zurich)

Learning from different data types is a long standing goal in machine learning research, as multiple information sources co-occur when describing natural phenomena. Existing generative models that try to approximate a multimodal ELBO rely on difficult training schemes to handle the intermodality dependencies, as well as the approximation of the joint representation in case of missing data. In this work, we propose an ELBO for multimodal data which learns the unimodal and joint multimodal posterior approximation functions directly via a dynamic prior. We show that this ELBO is directly derived from a variational inference setting for multiple data types, resulting in a divergence term which is the Jensen-Shannon divergence for multiple distributions. We compare the proposed multimodal JS-divergence (mmJSD) model to state-of-the-art methods and show promising results using our model in unsupervised, generative learning using a multimodal VAE on two different datasets.

## Improving Multimodal Generative Models with Disentangled Latent Partitions

Imant Daunhawer (ETH Zurich), Thomas Sutter (ETH Zurich), Julia E. Vogt (ETH Zurich)

Multimodal generative models learn a joint distribution of data from different modalities---a task which arguably benefits from the disentanglement of modality-specific and modality-invariant information. We propose a factorized latent variable model that learns named disentanglement on multimodal data without additional supervision. We demonstrate the disentanglement capabilities on simulated data, and show that disentangled representations can improve the conditional generation of missing modalities without sacrificing unconditional generation.

## Large Scale Graph Learning From Smooth Signals

Vassilis Kalofolias (LeanBI (ex LTS2)), Nathanael Perraudin (Swiss Data Science Center) [slides]

Graphs are a prevalent tool in data science, as they model the inherent structure of the data. Typically they are constructed either by connecting nearest samples, or by learning them from data, solving an optimization problem. While graph learning does achieve a better quality, it also comes with a higher computational cost. In particular, the current state-of-the-art model cost is O(n^2) for n samples. In this paper, we show how to scale it, obtaining an approximation with leading cost of O(n log(n)), with quality that approaches the exact graph learning model. Our algorithm uses known approximate nearest neighbor techniques to reduce the number of variables, and automatically selects the correct parameters of the model, requiring a single intuitive input: the desired edge density.

## Team Policy Learning for Multi-Agent Reinforcement Learning

Lucas Cassano (EPFL), Sulaiman Alghunaim (UCLA), Ali H. Sayed (EPFL)

This work presents a fully distributed algorithm for learning the optimal policy in a multi-agent cooperative reinforcement learning scenario. We focus on games that can only be solved through coordinated team work. We consider situations in which K players interact simultaneously with an environment and with each other to attain a common goal. In the algorithm, agents only communicate with other agents in their immediate neighborhood and choose their actions independently of one another based only on local information. Learning is done off-policy, which results in high data efficiency. The proposed algorithm is of the stochastic primal-dual kind and can be shown to converge even when used in conjunction with a wide class of function approximators.

## Improving Few-Shot User-Specific Gaze Adaptation via Gaze Redirection Synthesis

Yu Yu (Idiap and EPFL), Gang Liu (Idiap), Jean-Marc Odobez (Idiap and EPFL)

As an indicator of human attention gaze is a subtle behavioral cue which can be exploited in many applications. However, inferring 3D gaze direction is challenging even for deep neural networks given the lack of large amount of data (groundtruthing gaze is expensive and existing datasets use different setups) and the inherent presence of gaze biases due to person-specific difference. In this work, we address the problem of person-specific gaze model adaptation from only a few reference training samples. The main and novel idea is to improve gaze adaptation by generating additional training samples through the synthesis of gaze-redirected eye images from existing reference samples. In doing so, our contributions are threefold: (i) we design our gaze redirection framework from synthetic data, allowing us to benefit from aligned training sample pairs to predict accurate inverse mapping fields; (ii) we proposed a self-supervised approach for domain adaptation; (iii) we exploit the gaze redirection to improve the performance of person-specific gaze estimation. Extensive experiments on two public datasets demonstrate the validity of our gaze retargeting and gaze estimation framework.

## Slanted Frames: Predicting Partisanship from Video Data

Elliott Ash (ETH Zurich), Dominik Borer (ETH ZUrich)

We use machine learning to predict partisanship based on image data. Using a dataset of 17,000 images from televised political ads, we train a convolutional neural network to predict the associated partisan label (whether the associated candidate is Democrat or Republican). In the best model we achieved a prediction accuracy of 77% (F1=.79) on the held-out test dataset. In an empirical application, we show that video frames from a cable news channel with a conservative reputation (Fox News) tend to be predicted as Republican, while those from the more liberal networks (CNN, MSNBC) tend to be predicted as Democrat.

## Event-based regression with spiking neural networks

Mathias Gehrig (Robotics and Perception Group, University of Zurich and ETH Zurich), Sumit Bam Shrestha (Temasek Laboratories, National University of Singapore), Daniel Mouritzen (Robotics and Perception Group, University of Zurich and ETH Zurich), Davide Scaramuzza (Robotics and Perception Group, University of Zurich and ETH Zurich)

Spiking Neural Networks (SNNs) are bio-inspired networks that process information conveyed as temporal spikes rather than numeric values. An example of a sensor providing such data is the event-camera. It only produces an event when a pixel reports a significant brightness change. Similarly, the spiking neuron of an SNN only produces a spike whenever a significant number of spikes occur within a short period of time. The vast amount of work concerned with supervised learning for spiking neural networks is concerned with classification tasks. However, spiking neural networks are dynamical systems that are naturally suited for prediction in the temporal domain. As a step in this direction, we propose the problem of regressing the angular velocity of an event camera at high speeds and provide first results that show that feedforward spiking neural networks can perform well in this challenging setting.

## $\rho$-VAE: Autoregressive parametrization of the VAE encoder

Sohrab Ferdowsi (University of Geneva), Maurits Diephuis (University of Geneva), Shideh Rezaeifar (University of Geneva), Slava Voloshynovskiy (University of Geneva)

We make a minimal, but very effective alteration to the VAE model. This is about a drop-in replacement for the (sample-dependent) approximate posterior to change it from the standard white Gaussian with diagonal covariance to the first-order autoregressive Gaussian. We argue that this is a more reasonable choice to adopt for natural signals like images, as it does not force the existing correlation in the data to disappear in the posterior. Moreover, it allows more freedom for the approximate posterior to match the true posterior. This allows for the repararametrization trick, as well as the KL-divergence term to still have closed-form expressions, obviating the need for its sample-based estimation. Although providing more freedom to adapt to correlated distributions, our parametrization has even less number of parameters than the diagonal covariance, as it requires only two scalars, ρ and s, to characterize correlation and scaling, respectively. As validated by the experiments, our proposition noticeably and consistently improves the quality of image generation in a plug-and-play manner, needing no further parameter tuning, and across all setups. The code to reproduce our experiments is available at [annonimized].

## Understanding Raw Waveform based CNN through Low-rank Spectro-Temporal Decoupling

Vinayak Abrol (Mathematical Institute, University of Oxford), S. Pavankumar Dubagunta (Idiap Research Institute and Ecole polytechnique federale de Lausanne (EPFL)), Mathew Magimai Doss (Idiap Research Institute)

Acoustic modeling using convolutional neural networks with raw waveform input has become popular for many speech processing tasks. In the series of convolutional layers, the first layer acts as a frequency selective filter-bank, and the subsequent ones involve 2D spectro-temporal filters. Since each filter focuses on a different spectro-temporal pattern, there is an inherent redundancy in the learned filters. In this work we exploit this redundancy and propose an alternative approach to train robust networks for speech applications. This is achieved by spectro-temporal decoupling, where each 2D filter is approximated by a linear combination of a low-rank set of spectral/temporal basis filters. This filtering operation can be seen as filtering temporal spectral energy trajectories followed by their combination across spectral bands or vice versa. With a case study on speech recognition, we show that the proposed scheme is robust, efficient and achieves comparable/better baseline accuracy.

## Efficient High Dimensional Bayesian Optimization with Additivity and Quadrature Fourier Features

Mojmir Mutny (ETH Zurich), Andreas Krause (ETH Zurich)

We develop an efficient and provably no-regret Bayesian optimization (BO) algo- rithm for optimization of black-box functions in high dimensions. We assume a generalized additive model with possibly overlapping variable groups. When the groups do not overlap, we are able to provide the first provably no-regret polyno- mial time (in the number of evaluations of the acquisition function) algorithm for solving high dimensional BO. To make the optimization efficient and feasible, we introduce a novel deterministic Fourier Features approximation based on numeri- cal integration with detailed analysis for the squared exponential kernel. The error of this approximation decreases exponentially with the number of features, and al- lows for a precise approximation of both posterior mean and variance. In addition, the kernel matrix inversion improves in its complexity from cubic to essentially linear in the number of data points measured in basic arithmetic operations.

## Topological Autoencoders

Michael Moor (ETH Zurich), Max Horn (ETH Zurich), Bastian Rieck (ETH Zurich), Karsten Borgwardt (ETH Zurich) [slides]

We propose a novel approach for preserving topological structures of the input space in latent representations of autoencoders. Using persistent homology, a technique from topological data analysis, we calculate topological signatures of both the input and latent space to derive a topological loss term. Under weak theoretical assumptions, we can construct this loss in a differentiable manner, such that the encoding learns to retain multi-scale connectivity information. We show that our approach is theoretically well-founded, while exhibiting favourable latent representations on synthetic manifold data sets. Moreover, on real-world data sets, introducing our topological loss leads to more meaningful latent representations while preserving low reconstruction errors.

## Aligning Multilingual Word Embeddings for Cross-Modal Retrieval Task

Alireza Mohammadshahi (IDIAP/EPFL), Rémi Lebret (EPFL), Karl Aberer (EPFL)

In this paper, we propose a new approach to learn multimodal multilingual embeddings for matching images and their relevant captions in two languages. We combine two existing objective functions to make images and captions close in a joint embedding space while adapting the alignment of word embeddings between existing languages in our model. We show that our approach enables better generalization, achieving state-of-the-art performance in text-to-image and image-to-text retrieval task, and caption-caption similarity task. Two multimodal multilingual datasets are used for evaluation: Multi30k with German and English captions and Microsoft-COCO with English and Japanese captions.

## A t-distribution based operator for enhancing out of distribution robustness of neural network classifiers

Niccolò Antonello (Idiap Research Institute), Philip N. Garner (Idiap Research Institute)

Neural Network (NN) classifiers can assign high probabilities to samples that have not appeared during training (out of distribution samples) resulting in erroneous and unreliable predictions. One of the causes for this unwanted behaviour lies in the use of the standard softmax operator which pushes the posterior probabilities to be either zero or unity hence failing to model uncertainty. The statistical derivation of the softmax operator relies on the assumption that the distributions of the latent variables for a given class are Gaussian. However, it is possible to use different assumptions in the same derivation and attain from other families of distributions as well. This allows derivation of novel operators with more favourable properties. Here, a novel operator is proposed that is derived using t-distributions which are capable of providing a better description of uncertainty. It is shown that classifiers that adopt this novel operator are more robust to out of distribution samples, outperforming NNs that use the standard softmax operator. The resulting operator improves the area under the ROC curve from 0.82 to 0.96 when using MNIST as out of distribution for a Fashion NIST classifier. Ongoing work in a speech recognition scenario will be reported. These enhancements can be reached with minimal changes to the NN architecture and to the training procedure.

## A mixture of Tensor-normal Distribution for Imitation Learning in Robotics

Suhan Shetty (Idiap research institute), João Silvério (Idiap research institute), Sylvain Calinon (Idiap research institute)

Robot learning from demonstration is a challenging field with the number of available demonstrations typically limited. This necessitates learning techniques to exploit the structure of the data to be sample efficient. Recently, tensor data representation and associated learning techniques have gained popularity due to its interpretability and capability to exploit the multi-dimensional array structure of data, which is often encountered in practice in science and engineering. In robotics, we have identified several areas in imitation learning such as bimanual coordination and task-parameterized movement learning in which data from demonstrations can be naturally represented as a 3-way array, allowing us to use the power of tensor methods. We developed a mixture of tensor-normal distributions for modeling and inference of robot behaviors, where we exploit the multilinear regression and the conditioning properties of tensor-variate normal distributions. The resulting technique involves very few parameters and it is easy to interpret. We showcase this technique with real-world data taken from robotic experiments.

## Improving Variational Autoencoders Using Conditional Prior

Frantzeska LAVDA (University of Geneva & Geneva School of Business Administration, HES-SO), Magda GREGOROVA (University of Geneva & Geneva School of Business Administration, HES-SO), Alexandros KALOUSIS (University of Geneva & Geneva School of Business Administration, HES-SO)

We propose to learn a conditional prior in the context of variational autoencoders (VAEs) to avoid the over-regularisation resulting by a standard Gaussian prior distribution. Our work is motivated by recent analyses of the VAE objective, which pointed out not only that a simple prior can lead to underfitting but also the importance of the prior in the density estimation. In the proposed model we also learn a categorical variable additionally to the continuous latent variable of VAEs. Instead of using/ learning a complex prior for the continuous latent variable, the existence of the categorical latent variable makes it feasible to learn a continuous prior over each category. Moreover, the categorical variable along with the input are used to infer the continuous one, introducing structural inductive bias to guide the learning.

## Learning anisotropic filters on product graphs

Clément Vignac (LTS4, EPFL), Pascal Frossard (LTS4, EPFL)

The extension of convolutional neural networks to irregular domains has paved the way to promising graph data analysis methods. It has however come at the expense of a reduced representation power, as most of these new network architectures can only learn isotropic filters and therefore often underfit the training data. In this work, we propose a method for building anisotropic filters when learning representations of signals on a cartesian product graph. Instead of learning directly on the product graph, we factorize it and learn different filters for each factor, which is beneficial both in terms of computational cost and expressivity of the filters. We show experimentally that anisotropic Laplacian polynomials indeed outperform their isotropic counterpart on image classification and matrix completion tasks.

## Provably Robust Boosted Decision Stumps and Trees against Adversarial Attacks

Maksym Andriushchenko (EPFL (work done at the University of Tübingen)), Matthias Hein (University of Tübingen) [slides]

The problem of adversarial samples has been studied extensively for neural networks. However, for boosting, in particular boosted decision trees and decision stumps there are almost no results, even though boosted decision trees, as e.g. XGBoost, are quite popular due to their interpretability and good prediction performance. We show in this paper that for boosted decision stumps the exact min-max robust loss and test error for an $l_\infty$-attack can be computed in $O(T\log T)$ time per data point, where $T$ is the number of decision stumps. We also show that the exact robust loss is convex and that we can efficiently minimize it. While not exact, we show how to calculate and optimize an upper bound on the robust loss for boosted trees. Our proposed provably robust boosted trees lead to certified test errors which are competitive to provably robust convolutional networks. To the best of our knowledge, these are the first algorithms directly optimizing provable robustness guarantees in the area of boosting. We make the code of all our experiments publicly available at \url{http://github.com/max-andr/provably-robust-boosting}.

## Learning Entailment-Based Sentence Embeddings from Natural Language Inference

Rabeeh Karimi Mahabadi (Idiap Research Institute), Florian Mai (Idiap Research Institute), James Henderson (Idiap Research Institute) [slides]

Large datasets on natural language inference are a potentially valuable resource for inducing semantic representations of natural language sentences. But in most such models the semantic embeddings computed by the sentence encoder go through a trained MLP before predicting the entailment and contradiction labels, so some of the information in the data is encoded the parameters of this MLP, and the interpretation of the sentence embeddings in terms of entailment and contradiction are obscured by the MLP. In this work we propose a simple classifier based on predefined entailment and contradiction scores applied directly to the sentence embeddings. This parameter-free interaction model achieves results on natural language inference competitive with MLP-based models, demonstrating that the trained sentence embeddings directly represent the information needed for textual entailment, and the inductive bias of this model leads to better generalisation to other related datasets.

## Simple but effective techniques to reduce dataset biases

Rabeeh Karimi Mahabadi (idiap, EPFL), James Henderson (Idiap) [slides]

There have been several studies recently showing that strong natural language understanding (NLU) models are prone to relying on unwanted dataset biases without learning the underlying task, resulting in models which fail to generalize to out-of-domain datasets, and are likely to perform poorly in real-world scenarios. We propose several learning strategies to train neural models which are more ro- bust to such biases and transfer better to out-of-domain datasets. We introduce an additional lightweight bias-only model which learns dataset biases and uses its prediction to adjust the loss of the base model to reduce the biases. In other words, our methods down-weight the importance of the biased examples, and focus training on hard examples, i.e. examples that cannot be correctly classified by only relying on biases. Our approaches are model agnostic and simple to implement. We experiment on large-scale natural language inference and fact verification datasets and their out-of-domain datasets and show that our debiased models significantly improve the robustness in all settings, including gaining 9.76 points on the FEVER symmetric evaluation dataset, 5.45 on the HANS dataset and 4.78 points on the SNLI hard set. These datasets are specifically designed to assess the robustness of models in the out-of-domain setting where typical biases in the training data do not exist in the evaluation set.

## Entity Linking via Low-rank Subspaces

Akhil Arora (EPFL), Alberto Garcia-Duran (EPFL), Robert West (EPFL) [slides]

The phenomenal growth of unstructured data on the web has contributed significantly to the development of automatic methods for natural language understanding and processing, consequently rendering entity linking to be a problem of paramount importance. While a plethora of entity linking techniques proposed over the last decade have definitely enriched the field, the majority of these methods have been designed for settings where annotated data is available. Unfortunately, high quality annotated data is usually unavailable for specialized domains, such as medical, scientific, legal, and enterprise-specific corpora. In this paper, we propose a scalable and fully unsupervised method for entity linking that relies only on the availability of mention surface strings and a referent knowledge base. Our method provides an effective representation of global coherence information in a document, thereby providing a principled approach to perform collective disambiguation. Specifically, the set of candidate entities (obtained using string matching) corresponding to all the mention surface strings in a document is represented by a low-rank subspace spanned by the candidate entity embeddings instead of a vector space, which is typically computed as the average of candidate entity embeddings. The similarity score between a mention-candidate pair is then computed using a vector-subspace similarity function. Experiments on the CoNLL and Wikipedia web-tables benchmark datasets showcase the superiority of our approach.

## Unbiased Normal Approximations for Euler-Discretized Diffusions

Hadi Daneshmand (Vector Institute and ETHZ), Murat A. Erdogdu (Vector Institute and University of Toronto), Krishna Balasubramanian (UC Davis)

Euler discretization of Ito diffusions with a decreasing step-size sequence is crucial for simulating a nearly unbiased estimate of the target invariant measure. It is known that average of the iterates of this algorithm is consistent, and satisfies a central limit theorem (CLT) asymptotically. In this paper, we establish a non-asymptotic convergence rate for this CLT in 1-Wasserstein distance. More specifically, we show that when the step-size sequence satisfies $\Theta(n^{-1/2})$, then the convergence rate to a Gaussian measure is given by $\bigo(\ln(n) n^{-1/4})$. Our results rely on a non-asymptotic Martingale CLT which we prove using Stein's method and Skorohod's embedding.