Sie sind hier

Data Science

Data Science (cf. the Wikipedia definition of data science) describes an attitude towards treating problems with a set of capabilities that is not located in any classic community, but it is a set of capabilities that cross-breed between disciplines, such as physics, biology, social sciences and economics. It uses elaborate computer science paradigms and needs a background in statistics. It feeds the new as well as the classical economy as well as the medical field.  

Data scientists: IT's new rock stars

Lecturing Schedule

Student feedback in summer term 2014 has shown that many students lack basic knowledge of probability theory that some others have acquired in high school or during their bachelor studies. To accomodate for this feedback, Tuesday 14, 16.15hrs, and Thursday, 16.4., 12.15hrs, will be dedicated to an introduction/rehash of core concepts of probability theory. We answer questions like: What is science, and is there an invisible dwarf on my shoulder? How do I find out if my conversation partner is a frequentist or a Bayesian, and who is more fun to hang out with? What is a null hypothesis and why do we need it? Be prepared.

Extra appointment: 22.4.2015  at 16.15 hrs in room B013. A guest talk motivating data science will be given by Dr. Christoph Tempich (Head of Consulting, Innovex). Content of this talk will be relevant for the exam.

Team registration: Please register for your exercise team under

Exam: 06.08., 12-14, E 113

Second exam: 22.10., 12-14, E 413

Date  Time slot    Room Lecturer  Slides 
14.4 16.15-17.45 Introduction to probability theory - part 1 G309 Christoph Kling slides
16.4 12.15-13.45 Introduction to probability theory - part 2 K208 Christoph Kling slides
21.4 16.15-17.45 Lecture 1 G309 Steffen Staab 1st+2nd lecture slides
22.4 16.15-17.45 Extra Appointment B013 Dr. Christoph Tempich Guest lecture
23.4 12.15-13.45 Tutorial about probability theory K208 Christoph Kling allbus slides exercise01
28.4. 16.15-17.45 Tutorial 1 G309 Christoph Kling exercise02 slides
30.4. 12.15-13.45 Lecture 2 B017 Steffen Staab  
5.5. 16.15-17.45 Lecture 3 G309 Steffen Staab 3rd lecture slides 
7.5. 12.15-13.45 Tutorial 2 B017 Christoph Kling slides exercise
12.5. 16.15-17.45 Lecture 4 G309 Steffen Staab  
14.5.   Ascension Day      
19.5. 16.15-17.45 Tutorial 3 G309 Christoph Kling slides
21.5. 12.15-13.45 Tutorial 4 B017 Christoph Kling slides exercise
25.5.-29.5.   Whitsun Break      
2.6. 16.15-17.45 Lecture 5 G309 Steffen Staab 3rd lecture slides with minor modifications from slides 64 onwards
4.6.   Corpus Christi      
9.6. 16.15-17.45 Lecture 6 G309 Christoph Kling slides
11.6. 12.15-13.45 Tutorial 5 B017 Christoph Kling tutorial MLE_LinReg
16.6. 16.15-17.45 Tutorial 6 G309 Christoph Kling exercise tutorial
18.6. 12.15-13.45 Lecture 7 B017 Steffen Staab   
23.6. 16.15-17.45 Lecture 8 G309 Steffen Staab slides
25.6. 12.15-13.45 Tutorial 7 B017 Christoph Kling tutorial
30.6. 16.15-17.45 Tutorial 8 G309 Christoph Kling code + exercise
2.7. 12.15-13.45 Lecture 9 B017 Steffen Staab slides
7.7. 16.15-17.45 Tutorial 9 G309 Christoph Kling tutorial exercise
9.7. 12.15-13.45 Lecture 10 B017 Steffen Staab slides on Scalable Infrastructures, updated July 10
14.7. 16.15-17.45 Tutorial 10 G309 Christoph Kling exercise Kling slides
16.7. 12.15-13.45 Lecture 11 B017 Steffen Staab slides on Algebraic modelling
21.7. 16.15-17.45 Tutorial 11 G309 Christoph Kling  
23.7. 12.15-13.45 Q&A B017 Steffen Staab &
Christoph Kling


More information coming soon!


Vorlesung - Data Science

Veranstaltungsnummer: 04232

Dozent(in) Christoph Kling
Steffen Staab
Termin(e) Di 16.00-18.00
G 309, KO Gebäude G


Übung - Übung zu Data Science

Veranstaltungsnummer: 04232

Dozent(in) Christoph Kling
Steffen Staab
  • Mi 12.00-14.00
    K 208, Gebäude K


Lecturer: Prof. Steffen Staab 

Tutor: Christoph Kling

Format: Lecture + Practical Programming Excercises

Format: Lecture + Practical Programming Excercises


  • Programming with Octave or R
  • Programming with Hadoop or Spark
    • e.g. Generate word cloud with Hadoop
  • Programming with GraphLab
  • Working with visualizations
    • Histograms, e.g. Zipf distribution of words
  • Working with TwitteR (not = Twitter!)

Required background knowledge

  • Must have: Capability to program; several university courses in mathematics
  • Should have:
    • Basic statistics (or willingness to acquire it during the course)
    • C++ programming (or be eager to acquire it if you are already a Java expert)


Read [1] for a good start of what data science is about. 

Exercises - General Rules

The exercises will be done in groups of two students. For taking part in the exam, solutions for all but two exercises have to be submitted. For this, each group will get an own SVN repository.

Every group has to present one exercise, which can be chosen in the tutorial one week before. If no one volunteers, groups are chosen at random. All group members have to present a part of the exercise.

The chosen presentation appointment is mandatory. This means, if a group member is not present, then he/she has to be excused (e.g., by an medical certificate). Otherwise, the missing person will not be allowed to participate in the exam. 

Practical Programming Excercises

During the practical programming exercises some Apache Hadoop programs should be written and executed. If you want to execute these programs on your local computer, you have to install and configure Hadoop localy. How this is done for Windows and Linux is described in the following tutorials:


  1. Vasant Dhar. Data Science and Prediction. In: Communications of the ACM, December 2013, Vol. 56, No. 12, pp. 64-73
  2. Anand Rajaraman, Jeffrey Ullman, Jure Leskovec, Mining of Massive Datasets, Cambridge University Press (free download)
  3. Jeffrey Stanton, Introduction to Data Science (free download)
  4. John Hopcroft. Foundations of Data Science.
  6. Peter Norvig, Alon Halevy, Fernando Parreira. The unreasonable effectiveness of data. In: IEEE Intelligent Systems, March/April 2009.


Outdated List of Topics

  1. Data science: 
    history and background, change of paradigm from statistics to programming
  2. Problem scenarios:
    Our problem scenarios will mostly happen with open data, such as found on the Web and open statistical data (such as provided by governments etc.), e.g.:

    1. EU Open Data Portal
    2. eLisa
    3. Linked Open Data
    4. Medical data analysis
    5. Psycholinguistics (e.g. Beatles.pdfSuicidalPoets.pdf)
  3. Background in statistics 
    (cf. Introduction to Web Science) here, we will go into more details of computing statistics and determining the quality of a probabilistic model. In particular, we will look at a whole set of distributions:

    • Uniform distribution
    • Normal distribution
    • Exponential distribution
    • Power law distribution
    • Poisson distribution
    • Log normal distribution

    And we will look at quality measures such as:

    • Students' t-test (valid only for normal distributions)
    • Chi square
    • ANOVA
    • Kulback-Leibler and Jensen-Shannon
    • Kolmogorov-Smirnovv
  4. Hypothesis driven research
    1. Hypothesis testing
    2. Statistics fallacies (The theory of the stork)
    3. Applications, e.g. Web portal promotions, does not work everywhere (Jure Leskovec, Bernardo Huberman, Lada Adamic. The dynamics of viral marketing, Proc. of ACM EC 2006)
  5. Programming paradigms
    1. Relational and NoSQL Database Management Systems
    2. Parallel task processing: Gridgain
    3. MapReduce (Hadoop/Spark)
    4. Graph Paradigms (GraphLab, neo4j, RDF Databases)
    5. Homomorphic machine learning
  6. Visualization
    1. Power of visualization: TED Talk by Hans Rosling 
  7. Simple machine learning on large scale data 
  8. Example application domain: text
    • b-grams
    • n-grams
    • generalized n-grams (gappy n-grams)
  9. Privacy 

Prof. Dr. Steffen Staab


Dr. Christoph Kling