# Data Science

Data Science (cf. the Wikipedia definition of data science) describes an attitude towards treating problems with a set of capabilities that is not located in any classic community, but it is a set of capabilities that cross-breed between disciplines, such as physics, biology, social sciences and economics. It uses elaborate computer science paradigms and needs a background in statistics. It feeds the new as well as the classical economy as well as the medical field.

Data scientists: IT's new rock stars

## Lecturing Schedule

Student feedback in summer term 2014 has shown that many students lack basic knowledge of probability theory that some others have acquired in high school or during their bachelor studies. To accomodate for this feedback, Tuesday 14, 16.15hrs, and Thursday, 16.4., 12.15hrs, will be dedicated to an introduction/rehash of core concepts of probability theory. We answer questions like: What is science, and is there an invisible dwarf on my shoulder? How do I find out if my conversation partner is a frequentist or a Bayesian, and who is more fun to hang out with? What is a null hypothesis and why do we need it? Be prepared.

Extra appointment: 22.4.2015  at 16.15 hrs in room B013. A guest talk motivating data science will be given by Dr. Christoph Tempich (Head of Consulting, Innovex). Content of this talk will be relevant for the exam.

Exam: 06.08., 12-14, E 113

Second exam: 22.10., 12-14, E 413

 Date Time slot Room Lecturer Slides 14.4 16.15-17.45 Introduction to probability theory - part 1 G309 Christoph Kling slides 16.4 12.15-13.45 Introduction to probability theory - part 2 K208 Christoph Kling slides 21.4 16.15-17.45 Lecture 1 G309 Steffen Staab 1st+2nd lecture slides 22.4 16.15-17.45 Extra Appointment B013 Dr. Christoph Tempich Guest lecture 23.4 12.15-13.45 Tutorial about probability theory K208 Christoph Kling allbus slides exercise01 28.4. 16.15-17.45 Tutorial 1 G309 Christoph Kling exercise02 slides 30.4. ﻿12.15-13.45 Lecture 2 B017 Steffen Staab 5.5. 16.15-17.45 Lecture 3 G309 Steffen Staab 3rd lecture slides ﻿ 7.5. ﻿12.15-13.45 Tutorial 2 B017 Christoph Kling slides exercise 12.5. 16.15-17.45 Lecture 4 G309 Steffen Staab 14.5. ﻿ Ascension Day 19.5. 16.15-17.45 Tutorial 3 G309 Christoph Kling slides 21.5. ﻿12.15-13.45 Tutorial 4 B017 Christoph Kling slides exercise 25.5.-29.5. Whitsun Break 2.6. 16.15-17.45 Lecture 5 G309 Steffen Staab 3rd lecture slides with minor modifications from slides 64 onwards 4.6. Corpus Christi 9.6. 16.15-17.45 Lecture 6 G309 Christoph Kling slides 11.6. ﻿12.15-13.45 Tutorial 5 B017 Christoph Kling tutorial MLE_LinReg 16.6. 16.15-17.45 Tutorial 6 G309 Christoph Kling exercise tutorial 18.6. ﻿12.15-13.45 Lecture 7 B017 Steffen Staab 23.6. 16.15-17.45 Lecture 8 G309 Steffen Staab slides 25.6. ﻿12.15-13.45 Tutorial 7 B017 Christoph Kling tutorial 30.6. 16.15-17.45 Tutorial 8 G309 Christoph Kling code + exercise 2.7. ﻿12.15-13.45 Lecture 9 B017 Steffen Staab slides 7.7. 16.15-17.45 Tutorial 9 G309 Christoph Kling tutorial exercise 9.7. ﻿12.15-13.45 Lecture 10 B017 Steffen Staab slides on Scalable Infrastructures, updated July 10 14.7. 16.15-17.45 Tutorial 10 G309 Christoph Kling exercise Kling slides 16.7. ﻿12.15-13.45 Lecture 11 B017 Steffen Staab slides on Algebraic modelling 21.7. 16.15-17.45 Tutorial 11 G309 Christoph Kling 23.7. 12.15-13.45 Q&A B017 Steffen Staab & Christoph Kling

Vorlesung - Data Science

Veranstaltungsnummer: 04232

 Dozent(in) Christoph Kling Steffen Staab Termin(e) Di 16.00-18.00 G 309, KO Gebäude G

Übung - Übung zu Data Science

Veranstaltungsnummer: 04232

 Dozent(in) Christoph Kling Steffen Staab Termin(e) Mi 12.00-14.00 K 208, Gebäude K

Lecturer: Prof. Steffen Staab

Tutor: Christoph Kling

Format: Lecture + Practical Programming Excercises

Format: Lecture + Practical Programming Excercises

Exercises:

• Programming with Octave or R
• Programming with Hadoop or Spark
• e.g. Generate word cloud with Hadoop
• Programming with GraphLab
• Working with visualizations
• Histograms, e.g. Zipf distribution of words

Required background knowledge

• Must have: Capability to program; several university courses in mathematics
• Should have:
• Basic statistics (or willingness to acquire it during the course)
• C++ programming (or be eager to acquire it if you are already a Java expert)

Read [1] for a good start of what data science is about.

Exercises - General Rules

The exercises will be done in groups of two students. For taking part in the exam, solutions for all but two exercises have to be submitted. For this, each group will get an own SVN repository.

Every group has to present one exercise, which can be chosen in the tutorial one week before. If no one volunteers, groups are chosen at random. All group members have to present a part of the exercise.

The chosen presentation appointment is mandatory. This means, if a group member is not present, then he/she has to be excused (e.g., by an medical certificate). Otherwise, the missing person will not be allowed to participate in the exam.

Practical Programming Excercises

During the practical programming exercises some Apache Hadoop programs should be written and executed. If you want to execute these programs on your local computer, you have to install and configure Hadoop localy.

## Literature

1. Vasant Dhar. Data Science and Prediction. In: Communications of the ACM, December 2013, Vol. 56, No. 12, pp. 64-73
2. Anand Rajaraman, Jeffrey Ullman, Jure Leskovec, Mining of Massive Datasets, Cambridge University Press (free download)
4. John Hopcroft. Foundations of Data Science.
5. http://www.wolframscience.com/thebook.html
6. Peter Norvig, Alon Halevy, Fernando Parreira. The unreasonable effectiveness of data. In: IEEE Intelligent Systems, March/April 2009.

## Outdated List of Topics

1. Data science:
history and background, change of paradigm from statistics to programming
2. Problem scenarios:
Our problem scenarios will mostly happen with open data, such as found on the Web and open statistical data (such as provided by governments etc.), e.g.:

1. EU Open Data Portal
2. eLisa
4. Medical data analysis
5. Psycholinguistics (e.g. Beatles.pdfSuicidalPoets.pdf)
3. Background in statistics
(cf. Introduction to Web Science) here, we will go into more details of computing statistics and determining the quality of a probabilistic model. In particular, we will look at a whole set of distributions:

• Uniform distribution
• Normal distribution
• Exponential distribution
• Power law distribution
• Poisson distribution
• Log normal distribution

And we will look at quality measures such as:

• Students' t-test (valid only for normal distributions)
• Chi square
• ANOVA
• Kulback-Leibler and Jensen-Shannon
• Kolmogorov-Smirnovv
4. Hypothesis driven research
1. Hypothesis testing
2. Statistics fallacies (The theory of the stork)
3. Applications, e.g. Web portal promotions, does not work everywhere (Jure Leskovec, Bernardo Huberman, Lada Adamic. The dynamics of viral marketing, Proc. of ACM EC 2006)
1. Relational and NoSQL Database Management Systems
4. Graph Paradigms (GraphLab, neo4j, RDF Databases)
5. Homomorphic machine learning
6. Visualization
1. Power of visualization: TED Talk by Hans Rosling
7. Simple machine learning on large scale data
8. Example application domain: text
• b-grams
• n-grams
• generalized n-grams (gappy n-grams)
9. Privacy
Beteiligte:

## Prof. Dr. Steffen Staab

staab@uni-koblenz.de