Lecture: Mining Massive Datasets

Note: There will be no meeting on Oct. 18th. The first lecture will take place on Oct. 19th.

Overview

Data has supported research since the dawn of time, but recently there has been a paradigm shift in the way data is used. Today researchers and practitioners are mining data for patterns and trends that lead to new hypotheses. This shift is caused by the huge volumes of data available from, e.g., social media websites, web query logs, sensors, and medical devices. "Big Data" has been established as an umbrella term to cover such high-volume and complex data.

In this course, you will learn data mining and machine learning techniques to process large datasets and extract valuable knowledge from them. We will study modern computing frameworks for large-scale data analytics (e.g., Apache Hadoop/Spark) as well as models and algorithms for pattern detection in large data. In particular, we will discuss principles that are designed for today's complex data such as networks or temporal data. The practical relevance of these methods will be highlighted by multiple important applications such as fraud detection, recommendation, or community detection.

Optional hands-on projects: Besides the lectures and exercises, we will offer you the opportunity to work on some hands-on data science/machine learning project. Thus, you can apply the learned techniques in interesting real-world applications. Your performance in the project can be used to improve your final grading of the course.

The preliminary syllabus of the course is as follow

  • Introduction
    • Data Mining and Knowledge Discovery Process, Machine Learning
    • Applications, Tasks
  • Hashing & Sketches
    • Similarity search
    • Min-Hashing, Locality Sensitive Hashing
    • Bloom Filter
  • Dimensionality Reduction & Matrix Factorization
    • Feature Selection & Random Projections
    • PCA / SVD
    • Non-Negative Matrix Factorization and Extensions
  • (Distributed) Optimization
    • Unconstrained / Constrained Optimization
    • Convex Optimization
    • (Stochastic) Gradient descent
  • Network Data
    • Laws/Patterns and Generators
    • PageRank and Extensions, HITS
    • Clustering/Community Detection, Spectral Clustering
    • Probabilistic Models: Inference, Distributed Learning, Models for Network Data
  • Temporal Data & Streaming
    • Sampling Techniques
    • Counting Distinct Elements
    • Estimating moments
  • Systems & Tools
    • MapReduce and Extensions (e.g. Spark)
    • Big Learning Systems
    • Graph Processing Systems

Information

  • Lecture: Wednesdays, 2pm - 3:30/4pm, room Interims Hörsaal 2
  • Exercise: Tuesdays, 1:45pm - 3:15pm, room HS 1, Friedrich L. Bauer Hörsaal
  • For more details see TUMonline
  • All course material will be made available via Moodle

Literature