?Probabilistic Machine Learning on Big Data with the Amidst Toolbox

03/24/2017

12:00 – 13:30

Ingenio Room. Avenida de las Universidades 24 48007 Bilbao

Machine learning is having a great impact in many businesses nowadays. It encompasses a wide variety of different techniques where the so-called deep learning methods based on large neural networks are enjoying a big hype in the media. Deep learning methods are quitepowerful for predictions tasks, i.e. for learning complex mappings between a set of inputs and a set of outputs (e.g. image labeling, automatic translation, etc.). However they are quite data-hungry, behave like black-box models and they can be hardly used to extract knowledge from data.

The probabilistic view of machine learning based on Bayesian statistics works following the scientific method. In a first step, the set of hypothesis are explicitly defined with the help of a graphical language which allows to easily introduces the assumption of causal relationships (e.g. this virus causes this symptom), being also possible to model unobservable mechanisms (e.g. the presence of a virus). Then, a prior probability is assigned to each of the hypothesis. Finally, the hypotheses are tested against the empirical evidence (i.e. the data) by computing their posterior probability (given the data). Using this posterior probability we can compute which is the hypothesis that best explains the data. This approach is already having a great impact in many other scientific fields like genomics, cancer research, ecology, finance, etc.

In this talk, we will talk about the Amidst Toolbox which is a software package which allows to define general probabilistic machine learning models and apply them to small or big data sets by exploiting different hardware architectures ranging from multi-core CPUs (by relying of Java 8) to cluster of computers with hundreds of nodes (by relying on Apache Flink, Apache Spark and Amazon Web Services). We will also illustrate this approach in the context of a real use case scenario in the financial domain where the profiles of millions of customers are analyzed. Applications to autonomous driving will be also discussed.