Data Science
[go to overview]Summer Terms 2015
Data Science (cf. the Wikipedia definition of data science) describes an attitude towards treating problems with a set of capabilities that is not located in any classic community, but it is a set of capabilities that crossbreed between disciplines, such as physics, biology, social sciences and economics. It uses elaborate computer science paradigms and needs a background in statistics. It feeds the new as well as the classical economy as well as the medical field.
Data scientists: IT's new rock stars
Lecturing Schedule
Student feedback in summer term 2014 has shown that many students lack basic knowledge of probability theory that some others have acquired in high school or during their bachelor studies. To accomodate for this feedback, Tuesday 14, 16.15hrs, and Thursday, 16.4., 12.15hrs, will be dedicated to an introduction/rehash of core concepts of probability theory. We answer questions like: What is science, and is there an invisible dwarf on my shoulder? How do I find out if my conversation partner is a frequentist or a Bayesian, and who is more fun to hang out with? What is a null hypothesis and why do we need it? Be prepared.
Extra appointment: 22.4.2015 at 16.15 hrs in room B013. A guest talk motivating data science will be given by Dr. Christoph Tempich (Head of Consulting, Innovex). Content of this talk will be relevant for the exam.
Team registration: Please register for your exercise team under https://ist.unikoblenz.de/teams/en/user/registration/tlpo0skqma
Exam: 06.08., 1214, E 113
Second exam: 22.10., 1214, E 413
Date  Time slot  Room  Lecturer  Slides  
14.4  16.1517.45  Introduction to probability theory  part 1  G309  Christoph Kling  slides 
16.4  12.1513.45  Introduction to probability theory  part 2  K208  Christoph Kling  slides 
21.4  16.1517.45  Lecture 1  G309  Steffen Staab  1st+2nd lecture slides 
22.4  16.1517.45  Extra Appointment  B013  Dr. Christoph Tempich  Guest lecture 
23.4  12.1513.45  Tutorial about probability theory  K208  Christoph Kling  allbus slides exercise01 
28.4.  16.1517.45  Tutorial 1  G309  Christoph Kling  exercise02 slides 
30.4.  12.1513.45  Lecture 2  B017  Steffen Staab  
5.5.  16.1517.45  Lecture 3  G309  Steffen Staab  3rd lecture slides 
7.5.  12.1513.45  Tutorial 2  B017  Christoph Kling  slides exercise 
12.5.  16.1517.45  Lecture 4  G309  Steffen Staab  
14.5.   Ascension Day  
19.5.  16.1517.45  Tutorial 3  G309  Christoph Kling  slides 
21.5.  12.1513.45  Tutorial 4  B017  Christoph Kling  slides exercise 
25.5.29.5.  Whitsun Break  
2.6.  16.1517.45  Lecture 5  G309  Steffen Staab  3rd lecture slides with minor modifications from slides 64 onwards 
4.6.  Corpus Christi  
9.6.  16.1517.45  Lecture 6  G309  Christoph Kling  slides 
11.6.  12.1513.45  Tutorial 5  B017  Christoph Kling  tutorial MLE_LinReg 
16.6.  16.1517.45  Tutorial 6  G309  Christoph Kling  exercise tutorial 
18.6.  12.1513.45  Lecture 7  B017  Steffen Staab  
23.6.  16.1517.45  Lecture 8  G309  Steffen Staab  slides 
25.6.  12.1513.45  Tutorial 7  B017  Christoph Kling  tutorial 
30.6.  16.1517.45  Tutorial 8  G309  Christoph Kling  code + exercise 
2.7.  12.1513.45  Lecture 9  B017  Steffen Staab  slides 
7.7.  16.1517.45  Tutorial 9  G309  Christoph Kling  tutorial exercise 
9.7.  12.1513.45  Lecture 10  B017  Steffen Staab  slides on Scalable Infrastructures, updated July 10 
14.7.  16.1517.45  Tutorial 10  G309  Christoph Kling  exercise Kling slides 
16.7.  12.1513.45  Lecture 11  B017  Steffen Staab  slides on Algebraic modelling 
21.7.  16.1517.45  Tutorial 11  G309  Christoph Kling  
23.7.  12.1513.45  Q&A  B017  Steffen Staab & Christoph Kling 
More information coming soon!
Veranstaltungsnummer: 04232
Dozent(in)  Christoph Kling Steffen Staab 
Termin(e)  Di 16.0018.00 G 309, KO Gebäude G 
Veranstaltungsnummer: 04232
Dozent(in)  Christoph Kling Steffen Staab 
Termin(e) 

Lecturer: Prof. Steffen Staab
Tutor: Christoph Kling
Format: Lecture + Practical Programming Excercises
Format: Lecture + Practical Programming Excercises
Exercises:
 Programming with Octave or R
 Programming with Hadoop or Spark
 e.g. Generate word cloud with Hadoop
 Programming with GraphLab
 Working with visualizations
 Histograms, e.g. Zipf distribution of words
 Working with TwitteR (not = Twitter!)
Required background knowledge
 Must have: Capability to program; several university courses in mathematics
 Should have:
 Basic statistics (or willingness to acquire it during the course)
 C++ programming (or be eager to acquire it if you are already a Java expert)
Read [1] for a good start of what data science is about.
Exercises  General Rules
The exercises will be done in groups of two students. For taking part in the exam, solutions for all but two exercises have to be submitted. For this, each group will get an own SVN repository.
Every group has to present one exercise, which can be chosen in the tutorial one week before. If no one volunteers, groups are chosen at random. All group members have to present a part of the exercise.
The chosen presentation appointment is mandatory. This means, if a group member is not present, then he/she has to be excused (e.g., by an medical certificate). Otherwise, the missing person will not be allowed to participate in the exam.
Practical Programming Excercises
During the practical programming exercises some Apache Hadoop programs should be written and executed. If you want to execute these programs on your local computer, you have to install and configure Hadoop localy.
Literature
 Vasant Dhar. Data Science and Prediction. In: Communications of the ACM, December 2013, Vol. 56, No. 12, pp. 6473
 Anand Rajaraman, Jeffrey Ullman, Jure Leskovec, Mining of Massive Datasets, Cambridge University Press (free download)
 Jeffrey Stanton, Introduction to Data Science (free download)
 John Hopcroft. Foundations of Data Science.
 * http://www.wolframscience.com/thebook.html
 Peter Norvig, Alon Halevy, Fernando Parreira. The unreasonable effectiveness of data. In: IEEE Intelligent Systems, March/April 2009.
Outdated List of Topics
 Data science:
history and background, change of paradigm from statistics to programming  Problem scenarios:
Our problem scenarios will mostly happen with open data, such as found on the Web and open statistical data (such as provided by governments etc.), e.g.: EU Open Data Portal
 eLisa
 Linked Open Data
 Medical data analysis
 Psycholinguistics (e.g. Beatles.pdf, SuicidalPoets.pdf)
 Background in statistics
(cf. Introduction to Web Science) here, we will go into more details of computing statistics and determining the quality of a probabilistic model. In particular, we will look at a whole set of distributions: Uniform distribution
 Normal distribution
 Exponential distribution
 Power law distribution
 Poisson distribution
 Log normal distribution
And we will look at quality measures such as:
 Students' ttest (valid only for normal distributions)
 Chi square
 ANOVA
 KulbackLeibler and JensenShannon
 KolmogorovSmirnovv
 Hypothesis driven research
 Hypothesis testing
 Statistics fallacies (The theory of the stork)
 Applications, e.g. Web portal promotions, does not work everywhere (Jure Leskovec, Bernardo Huberman, Lada Adamic. The dynamics of viral marketing, Proc. of ACM EC 2006)
 Programming paradigms
 Relational and NoSQL Database Management Systems
 Parallel task processing: Gridgain
 MapReduce (Hadoop/Spark)
 Graph Paradigms (GraphLab, neo4j, RDF Databases)
 Homomorphic machine learning
 Visualization
 Power of visualization: TED Talk by Hans Rosling
 Simple machine learning on large scale data
 Example application domain: text
 bgrams
 ngrams
 generalized ngrams (gappy ngrams)
 Privacy