?Probabilistic Machine Learning on Big Data with the Amidst Toolbox
The probabilistic view of machine learning based on Bayesian statistics works following the scientific method. In a first step, the set of hypothesis are explicitly defined with the help of a graphical language which allows to easily introduces the assumption of causal relationships (e.g. this virus causes this symptom), being also possible to model unobservable mechanisms (e.g. the presence of a virus). Then, a prior probability is assigned to each of the hypothesis. Finally, the hypotheses are tested against the empirical evidence (i.e. the data) by computing their posterior probability (given the data). Using this posterior probability we can compute which is the hypothesis that best explains the data. This approach is already having a great impact in many other scientific fields like genomics, cancer research, ecology, finance, etc.
In this talk, we will talk about the Amidst Toolbox which is a software package which allows to define general probabilistic machine learning models and apply them to small or big data sets by exploiting different hardware architectures ranging from multi-core CPUs (by relying of Java 8) to cluster of computers with hundreds of nodes (by relying on Apache Flink, Apache Spark and Amazon Web Services). We will also illustrate this approach in the context of a real use case scenario in the financial domain where the profiles of millions of customers are analyzed. Applications to autonomous driving will be also discussed.