Statistical theory of unsupervised learning

Machine learning is often viewed as a statistical problem, that is, given access to data that is generated from some statistical distribution, the goal of machine learning is to find a good prediction rule or a intrinsic structure in the data. Statistical learning theory considers this point of view to provide a theoretical foundation for supervised machine learning, and also provides mathematical techniques to analyse the performance of supervised learning algorithms.

The picture is quite different in unsupervised learning, where the absence of labelled data makes it difficult to develop a theory that ensures an algorithm always performs well (some philosophical thoughts in this direction). This makes the problem quite attractive to theoreticians, even beyond the machine learning community and different techniques are employed to explain the performance of unsupervised learning algorithms. For instance, worst-case performance guarantees for clustering algorithms are often studied in theoretical computer science [KSS'04;Das'16], whereas performance in the limiting case of infinitely large datasets is studied in statistics [Pol'81;LBB'08]. Even statistical physicists study clustering, but in a setting where the data has a known hidden clustering and one studies how accurately can an algorithm find this hidden clsutering [Moo'17]. Our group focusses on a variety of these techniques, and how they can be used to explain the performance of a variety of unsupervised learning algorithms, beyond clustering.

Theory of deep learning

The phenomenal performance of modern neural network architectures is widely known, even beyond computer science. What is hardly discussed is the implication of this success of NNs on the theory community. While classical statistical learning theory can successfully explain the behaviour of many machine learning algorithms (like SVMs, kernel machines), the theoretical predictions for deep NNs differ significantly from empirical findings [ZBH+'21]. A classic example of the deviation between theory and practice is the remarkable performance of over-parametrised NNs (where number of network parameters exceed the amount of training data). Classical learning theory predicts that such networks would overfit, and cannot generalise to new data, and yet these networks do generalise well in practice.

In recent years, the focus of theoretical analysis has considerably shifted to understanding how optimisation works on NNs, and what kind of solutions are learned during training. Our group particularly focusses on over-parametrised NNs, which (i) can be approximately analysed similar to kernel machine (neural tangent kernel) and (ii) exhibits an unconventional phenomena, known as double descent [NKB+'19]. We also study the behaviour of NNs beyond supervised learning.

Non-parametric methods for learning

Non-parametric approaches play a key role in learning complex distributions, structures or decision boundaries. Kernel methods are among the most popular non-parametric mapproaches in machine learning. Intuitively, kernel methods implicitly map the input data into a high-dimensional (possibly infinite-dimensional) space and apply standard ML algorithms in this space. While this intuition forms the basis for theoretical analysis of kernel machines used in supervised learning, the understanding of kernel methods is limited in general learning problems, beyond the supervised setting. Part of our focus is providing rigorous mathematical explanations for when kernel methods can work in practice, and also related kernel methods to other non-parametric approaches, such as density-based methods [VBLG'21]. 

Random (Fourier) features were originally proposed as fast approximations for kernel methods, and has gained wide popularity in both theory and practice (blog by Benjamin Recht). Apart from the connection to kernel methods, random Fourier features also provide a theoretical alternative to understand shallow NNs [BHZM'19MM'19].  Hence, Fourier features are part of focus in our group. We are also working on a new (but natural) direction of analysis -- study of non-parametric approaches under non-maratric distributional assumptions. Such analysis provides insight into why non-parametric methods work well on wide range of input data.

Machine learning on graphs

Graph-based learning problems arise in numerous applications, ranging from predicting functionality of protein structures to finding implicit communities in social networks. It is widely known that standard ML models or NNs are not directly applicable to graph-structured data, and due to similar reasons, classical learning theory is not suitable for explaining the performance of graph-based learning problems and algorithms. 

We focus on the algorthmic development and theoretical analysis of graph-based learning problems, including study of graph neural networks, statistical hypothesis testing on graphs, finding communities in graphs as well as clustering multiple graphs. With respect to problems involving multiple graphs, it is worth noting that non-parametric approaches (like graph kernels) have shown success in practice. We are currently working on alternative non-parametric approaches based on graphons, which are non-parametric limits of infinitely large graphs. ML methods based on graphons have been successful in some problems, but the complexity of graphons makes it less attractive in practice -- a phenomena that we intend to change by constructing simple, yet provable, graphon methods.