# Machine Learning

Facilitating research collaborations between divisions and between the department of Information technology and external parts.

# Machine Learning Seminar Series

### October 2017

**Speaker:** Phillip Henning (Max Planck Institute for Intelligent Systems, Tübingen, Germany)

**Date:** October 17th @ 13.15

**Room:** 2446

**Title:** Probabilistic Numerics — Uncertainty in Computation

**Abstract:**

The computational complexity of inference from data is dominated by the solution of non-analytic numerical problems (large-scale linear algebra, optimization, integration, the solution of differential equations). But a converse of sorts is also true — numerical algorithms for these tasks are inference engines! They estimate intractable, latent quantities by collecting the observable result of tractable computations. Because they also decide adaptively which computations to perform, these methods can be interpreted as autonomous inference agents. This observation lies at the heart of the emerging topic of Probabilistic Numerical Computation, which applies the concepts of probabilistic (Bayesian) inference to the design of algorithms, assigning a notion of probabilistic uncertainty to the result even of deterministic computations. I will outline how this viewpoint is connected to that of classic numerical analysis, and show that thinking about computation as inference affords novel, practical answers to the challenges of large-scale, big data, inference.

## Past Seminars

#### May 2017

**Speaker:** Simon Bartels (Max Planck Institute for Intelligent Systems, Tübingen, Germany)

**Date:** 4th May 13.15-14.00

**Room:** ITC 2344

**Title:** Probabilistic Approximate Least-Squares

**Abstract:** Algorithms for approximate least-squares tend not to return error bars on the elements of the inverse matrix itself (but only residuals on the linear problem being solved). Since least-squares problems are fundamental across virtually all areas of quantitative science, such uncertainty estimates have many interesting use cases, particularly if they can be provided at small or negligible cost overhead.

Leveraging recent results that cast elementary linear algebra operations as probabilistic inference, I will present a framework that uses varying prior regularity assumptions over the structure of the matrix to fit a Gaussian probability measure around the point estimates constructed by classic methods. The covariance of this measure then yields a notion of uncertainty, which can be used, for example, to control the computational effort invested in an approximation.

#### October 2016

Speaker: Daniel Gillblad (Swedish Institute of Computer Science)

Date: 8th October

Topic: Discussion on Learning Machines Center at SICS

Abstract: Machine Learning, especially on large data sets and in autonomous applications, is becoming a focal point of research at SICS, Swedish Institute of Computer Science. We will present Learning Machines, the center we are currently setting up to provide focused research within large-scale machine learning and its applications. We will show some recent results within e.g. graph clustering, anomaly detection, and distributed systems, and present the research directions and challenges that we believe are critical for promoting real-world applications of the techniques.

Speaker: Marco Loog (Delft University of Technology, The Netherlands)

Date: 21st October

Topic: A Proper Framework for Semi-Supervised Likelihood Estimation

Abstract: Semi-supervised learning is concerned with learning from both labeled and unlabeled data. I propose a generally applicable way to perform semi-supervised parameter estimation for likelihood-based classifiers. I will explain in what sense the approach provides a proper solution to this long-standing open problem. Among others, I will demonstrate that the estimates obtained are never worse than the supervised solution in terms of the log-likelihood on the training data. If time permits, I will take it one step further and sketch the proof for linear discriminant analysis [LDA], showing that semi-supervised learning is strictly better than its supervised counterpart. It demonstrates how and in what way the semi-supervised LDA proposed is essentially closer to the optimal solution that could be obtained would all data in the training set have been labeled.

#### May 2014

Speaker: Jose Pena

Date: 28th May

Topic: Beyond Bayesian networks

Abstract: Graphical models are a formalism to represent sets of independencies, also known as independence models, via missing edges in graphs whose nodes are the random variables of interest. The graphical models are divided into families depending on whether the edges are directed, undirected, and/or bidirected. Undoubtedly, the most popular family is that consisting of directed and acyclic graphs (DAGs), also known as Bayesian networks.

Chain graphs (CGs) are graphs with possibly directed and undirected edges, and no semidirected cycle. Then, CGs extend DAGs and, thus, they can represent at least as many independence models as DAGs. However, unlike DAGs whose interpretation is unique, there are three interpretations of CGs as independence models, and no interpretation subsumes any other.

In this talk, I will introduce CGs under the different interpretations. I will also compare their expressivity under the different interpretations, and with respect to DAGs. Specifically, I will show that CGs are much more expressive than DAGs but that this comes at a cost, namely they are more difficult to work with. This talk summarizes my work on making CGs usable in practice.

#### April 2014

Speaker: Alexandre Proutiere

Date: 23rd April

Topic: Community Detection via Random and Adaptive Sampling

Abstract: In this talk, we consider networks consisting of a finite number of non-overlapping communities. To extract these communities, the interaction between pairs of nodes may be sampled from a large available data set, which allows a given node pair to be sampled several times. When a node pair is sampled, the observed outcome is a binary random variable, equal to 1 if nodes interact and to 0 otherwise. The outcome is more likely to be positive if nodes belong to the same communities. For a given budget of node pair samples or observations, we wish to jointly design a sampling strategy (the sequence of sampled node pairs) and a clustering algorithm that recover the hidden communities with the highest possible accuracy. We consider both non-adaptive and adaptive sampling strategies, and for both classes of strategies, we derive fundamental performance limits satisfied by any sampling and clustering algorithm. In particular, we provide necessary conditions for the existence of algorithms recovering the communities accurately as the network size grows large. We also devise simple algorithms that accurately reconstruct the communities when this is at all possible, hence proving that the proposed necessary conditions for accurate community detection are also sufficient. The classical problem of community detection in the stochastic block model can be seen as a particular instance of the problems consider here. But our framework covers more general scenarios where the sequence of sampled node pairs can be designed in an adaptive manner. We provide new results for the stochastic block model, and extends the analysis to the case of adaptive sampling.

#### March 2014

Speaker: Joakim Nivre

Date: 19th March

Topic: Machine Learning for Natural Language Parsing

Abstract: One of the core tasks in natural language processing is syntactic parsing, that is, the process of analyzing the syntactic structure of sentences in order to figure out "who does what to whom". For example, an information extraction system analyzing the sentence "Microsoft buys Nokia" needs to figure out which company is buying and which company is being bought. Present-day parsing technology makes heavy use of machine learning applied to large databases of linguistically annotated sentences, so-called treebanks, in order to learn a mapping from sentences to their analyses. In this mapping, both the input space and the output space are infinite sets of structured objects, such as strings and trees, which means that we need models for structured prediction, as opposed to standard classification or regression. In this talk, I will present a particular approach to syntactic parsing known as transition-based dependency parsing, which factors the parsing problem by the structure of derivations in a transition system, and where the learning problem is to find the optimal derivation for a given sentence. In this setup, I will present three different learning schemes that provide different trade-offs between accuracy and efficiency: imitation learning with a static oracle, structured perceptron learning using beam search, and imitation learning using dynamic exploration.

#### Feburary 2014

Speaker: Simo Särkkä (Aalto University, Finland)

Date: 12th February

Topic: State-Space Representation of Gaussian Processes for Regression and Efficient Inference in Latent Force Models

Abstract: Gaussian process regression is a non-parametric Bayesian machine learning paradigm, where instead of estimating parameters of fixed-form functions, we model the whole unknown functions as Gaussian processes. Gaussian processes are also commonly used for representing uncertainties in models of dynamic systems in many applications such as tracking, navigation, and automatic control systems. The latter models are often formulated as state-space models, where the use of non-linear Kalman filter type of methods is common. The aim of this talk is to discuss connections of Kalman filtering methods and Gaussian process regression. In particular, I discuss representations of Gaussian processes as state-space models, which enable the use of computationally efficient Kalman-filter-based (or more general Bayesian-filter-based) solutions to Gaussian process regression problems. This also allows for computationally efficient inference in latent force models (LFM), which are models combining first-principles mechanical models with non-parametric Gaussian process regression models.

#### January 2014

Speaker: Jan Komorowski

Date: 15th

Topic: Machine Learning in Bioinformatics - From Data to Interpretations

Abstract: As a Computer Scientist, I am often reminded by my Biology colleagues how complex and intricate biological relations are. And they often wish that I come with an illumination of which features they measured were the most important ones, and best, in what combinations and values. Imagine how astonished I must be when the next request is

"give me the most significantly differentially expressed gene or protein". It is possible that our end, ie. bioinformaticians, should be in part blamed for this incongruence.

In this talk I shall show that we can satisfy the Biologist. To this end I shall briefly introduce the formalism of rough sets founded in Boolean reasoning that serves rule-based

model construction, and in case where NP-hardness stands in the way, Monte Carlo feature selection will be applied. Our methodology provides ranking of significant

features, and makes the structure of the model available to biological interpretation. Unexpectedly, even very poor classifiers may be of value.

The methodology will be illustrated with examples from proteomics, functional genomics and epigenetics. Biological knowledge necessary to grasp the essence of the problems will

be provided.

#### December 2013

Speaker: Josef Höök

Topic: Parametric inference for observed diffusions on the financial markets

Date: Wednesday 18th

Abstract: Parameterized diffusion models in continuous time can be described by stochastic differential equations with coefficients that depend on a set of unknown parameters and are used very frequently in mathematical finance. Portfolio optimization and decision support for making investment and save positions are two examples of topics in finance where diffusion models are used. For these models to be useful in practice it is necessary to tune the parameters as accurately as possible against observed data using some sort of parameter estimation method. In this talk we will go through two classes of parameter estimation techniques, the approximate maximum likelihood method and the martingale estimating functions method for general diffusion models and discuss the computational aspects of them. In the approximate maximum likelihood method, the likelihood can be obtained from computer intensive simulations by either solving a partial differential equation with numerical techniques like finite differences, radial basis function approximations or through Monte Carlo simulation of a corresponding stochastic differential equation. The martingale estimating functions method can be seen as a generalization of the likelihood method and where the idea is to formulate test conditions that become zero for the conditional probability densities that best describes the observed data. Traditionally this method has been applied to models where analytical expressions for these conditions are known. The talk will cover the applicability of the respective methods and we will also present new results that improve the computational efficiency for simulated martingale estimating functions.

#### November 2013

Speaker: Marc Deisenroth (Imperial College London, UK)

Topic: Data-Efficient Autonomous Learning in Robotics and Control

Date: Wednesday November 27th

Abstract: Autonomous learning has been a promising direction in control and robotics for more than a decade since learning models and controllers from data allows us to reduce the amount of engineering knowledge that is otherwise required. Due to their flexibility, autonomous reinforcement learning (RL) approaches typically require many interactions with the system to learn controllers. However, in real systems, such as robots, many interactions can be impractical and time consuming. To address this problem, current learning approaches

typically require task-specific knowledge in form of expert demonstrations, pre-shaped policies, or specific knowledge about the underlying dynamics. In the first part of the talk, we follow a different approach and speed up learning by efficiently extracting valuable information from sparse data. In particular, we learn a probabilistic, non-parametric Gaussian

process dynamics model. By explicitly incorporating model uncertainty into long-term planning and controller learning our approach reduces the effects of model errors, a key problem in model-based learning. Compared to state-of-the-art RL our model-based policy search method achieves an unprecedented speed of learning. We demonstrate its applicability to autonomous learning in real robot and control tasks. In the second part of the talk, we will discuss an alternative method for learning controllers when it is no longer possible to learn (reasonable) forward models: Bayesian optimization. Bayesian optimization is typically used to optimize expensive-to-evaluate functions. We successfully applied Bayesian optimization to learning controller parameters for a bipedal robot, where modeling the dynamics is very difficult due to ground contacts. Using Bayesian optimization, we sidestep this modeling issue and directly optimize the controller parameters without the need of modeling the robot's dynamics.

#### October 2013

Speaker: Hedvig Kjellström and Carl Henrik Ek (KTH)

Topic: Methods and applications of machine learning in robotics and computer vision

Date: Wednesday October 9th

Abstract: In many applications in Robotics and Computer Vision, a model is used to infer the values of underlying variables from data coming from many different modalities, or views. Learning models for multiple view data is a very challenging problem, especially from a generative perspective. The main culprit of this stems from the fact that all the variations in the data need to be represented within the model, while the two views actually often only contain different subsets of the variations. This means that in many scenarios, one view pollutes the representation of the other, leading to a worse representation of the data. One approach in such scenarios is to learn a factorized latent representation, which allows for different parts of the latent representation to be responsible for generating different parts of the data.

We will in this talk show two models that exploits the concept of a latent factorization; a continuous model based on Gaussian Processes and a discrete topic model as an extension of Latent Dirichlet Allocation. We will show how different aspects of the factorization can be exploited in order to perform classification, ambiguity modelling, and to synthesize novel data in an intuitive manner.

#### September 2013

Speaker: Thomas Schön (Uppsala/Linköping)

Topic: Nonlinear system identification enabled via sequential Monte Carlo

Date: Wednesday September 18th

Abstract: Sequential Monte Carlo (SMC) methods are computational methods primarily used to deal with the state inference problem in nonlinear state space models. The particle filters and the particle smoothers are the most popular SMC methods. These methods open up for nonlinear system identification (both maximum likelihood and Bayesian solutions) in a systematic way. As we will see it is not a matter of directly applying the SMC algorithms, but there are several ways in which they enter as a natural part of the solution. The use of SMC for nonlinear system identification is a relatively recent development and the aim here is to first provide a brief overview of how SMC can be used in solving challenging nonlinear system identification problems by sketching both maximum likelihood and Bayesian solutions. We will then introduce a recent powerful class of algorithms collectively referred to as Particle Markov Chain Monte Carlo (PMCMC) targeting the Bayesian problem. PMCMC provides a systematic way of combining SMC and MCMC, where SMC is used to construct the high-dimensional proposal density for the MCMC sampler. The first results emerged in 2010 and since then we have witnessed a steadily increasing activity within this area. We focus on our new PMCMC method "Particle Gibbs with ancestor sampling" and show its use in computing the posterior distribution for a general Wiener model (i.e. identifying a Bayesian Wiener model).