Lecture: Mining Massive Datasets

Information: The number of course participants is limited this year (to ensure a high quality correction of the project tasks and taking into account the limited personal capacity available). The selection of participants will be done after the closing date of the registration period. That is, we will not follow a "first come, first serve" principle.


Data has supported research since the dawn of time, but recently there has been a paradigm shift in the way data is used. Today researchers and practitioners are mining data for patterns and trends that lead to new hypotheses. This shift is caused by the huge volumes of data available from, e.g., social media websites, web query logs, sensors, and medical devices. "Big Data" has been established as an umbrella term to cover such high-volume and complex data.

In this course, you will learn advanced machine learning and data mining techniques to process such complex data. Besides introducing the fundamental concepts, we will showcase them for their use in analyzing (i) high-dimensional data, (ii) graphs/network data, and (iii) temporal data. The practical relevance of these methods will be highlighted by multiple important applications such as fraud detection, recommendation, or community detection.

This course builds upon the knowledge you gained in the lecture Machine Learning (IN2064). It provides advanced learning principles and covers more complex data domains.

The preliminary syllabus of the course is as follow

  • Introduction
    • Machine Learning, Data Mining and Knowledge Discovery Process
    • Applications, Tasks
  • High-Dimensional Data
    • Hashing & Sketches
      • Min-Hashing
      • Locality Sensitive Hashing
    • Dimensionality Reduction & Matrix Factorization
      • Feature Selection & Random Projections
      • Non-Negative Matrix Factorization and Extensions
  • Graphs / Networks
    • Laws, Patterns and Generators
    • Spectral Learning
      • Ranking (e.g., PageRank, HITS)
      • Community Detection
    • Probabilistic Models
      • Stochastic Blockmodel (SBM)
      • (Stochastic) Variational Inference
      • Belief Propagation
    • Representation Learning for Graphs
      • Deep Learning for Graph Data
      • (Unsupervised) Node Embeddings
  • Temporal Data & Streaming
    • Sampling & Sketches
      • Bloom Filter
      • Counting Distinct Elements
      • Estimating moments
    • HMMs, Belief Propagation
    • Neural Networks: RNN, LSTM


  • Lecture/Exercise: Wednesdays, 2:30pm - 4:00pm, room Interims Hörsaal 1
  • Lecture/Exercise: Thursdays, 2:00pm - 4:00pm, room Interims Hörsaal 1
  • All course material will be made available via Piazza
  • Required knowledge: Content of our Machine Learning lecture