Data Science
[go to overview]Summer Term 2015
Data Science (cf. the Wikipedia definition of data science) describes an attitude towards treating problems with a set of capabilities that is not located in any classic community, but it is a set of capabilities that cross-breed between disciplines, such as physics, biology, social sciences and economics. It uses elaborate computer science paradigms and needs a background in statistics. It feeds the new as well as the classical economy as well as the medical field.
Data scientists: IT's new rock stars
Lecturing Schedule
Student feedback in summer term 2014 has shown that many students lack basic knowledge of probability theory that some others have acquired in high school or during their bachelor studies. To accomodate for this feedback, Tuesday 14, 16.15hrs, and Thursday, 16.4., 12.15hrs, will be dedicated to an introduction/rehash of core concepts of probability theory. We answer questions like: What is science, and is there an invisible dwarf on my shoulder? How do I find out if my conversation partner is a frequentist or a Bayesian, and who is more fun to hang out with? What is a null hypothesis and why do we need it? Be prepared.
Extra appointment: 22.4.2015 at 16.15 hrs in room B013. A guest talk motivating data science will be given by Dr. Christoph Tempich (Head of Consulting, Innovex). Content of this talk will be relevant for the exam.
Team registration: Please register for your exercise team under https://ist.uni-koblenz.de/teams/en/user/registration/tlpo0skqma
Exam: 06.08., 12-14, E 113
Second exam: 22.10., 12-14, E 413
Date | Time slot | Room | Lecturer | Slides | |
14.4 | 16.15-17.45 | Introduction to probability theory - part 1 | G309 | Christoph Kling | slides |
16.4 | 12.15-13.45 | Introduction to probability theory - part 2 | K208 | Christoph Kling | slides |
21.4 | 16.15-17.45 | Lecture 1 | G309 | Steffen Staab | 1st+2nd lecture slides |
22.4 | 16.15-17.45 | Extra Appointment | B013 | Dr. Christoph Tempich | Guest lecture |
23.4 | 12.15-13.45 | Tutorial about probability theory | K208 | Christoph Kling | allbus slides exercise01 |
28.4. | 16.15-17.45 | Tutorial 1 | G309 | Christoph Kling | exercise02 slides |
30.4. | 12.15-13.45 | Lecture 2 | B017 | Steffen Staab | |
5.5. | 16.15-17.45 | Lecture 3 | G309 | Steffen Staab | 3rd lecture slides |
7.5. | 12.15-13.45 | Tutorial 2 | B017 | Christoph Kling | slides exercise |
12.5. | 16.15-17.45 | Lecture 4 | G309 | Steffen Staab | |
14.5. | | Ascension Day | |||
19.5. | 16.15-17.45 | Tutorial 3 | G309 | Christoph Kling | slides |
21.5. | 12.15-13.45 | Tutorial 4 | B017 | Christoph Kling | slides exercise |
25.5.-29.5. | Whitsun Break | ||||
2.6. | 16.15-17.45 | Lecture 5 | G309 | Steffen Staab | 3rd lecture slides with minor modifications from slides 64 onwards |
4.6. | Corpus Christi | ||||
9.6. | 16.15-17.45 | Lecture 6 | G309 | Christoph Kling | slides |
11.6. | 12.15-13.45 | Tutorial 5 | B017 | Christoph Kling | tutorial MLE_LinReg |
16.6. | 16.15-17.45 | Tutorial 6 | G309 | Christoph Kling | exercise tutorial |
18.6. | 12.15-13.45 | Lecture 7 | B017 | Steffen Staab | |
23.6. | 16.15-17.45 | Lecture 8 | G309 | Steffen Staab | slides |
25.6. | 12.15-13.45 | Tutorial 7 | B017 | Christoph Kling | tutorial |
30.6. | 16.15-17.45 | Tutorial 8 | G309 | Christoph Kling | code + exercise |
2.7. | 12.15-13.45 | Lecture 9 | B017 | Steffen Staab | slides |
7.7. | 16.15-17.45 | Tutorial 9 | G309 | Christoph Kling | tutorial exercise |
9.7. | 12.15-13.45 | Lecture 10 | B017 | Steffen Staab | slides on Scalable Infrastructures, updated July 10 |
14.7. | 16.15-17.45 | Tutorial 10 | G309 | Christoph Kling | exercise Kling slides |
16.7. | 12.15-13.45 | Lecture 11 | B017 | Steffen Staab | slides on Algebraic modelling |
21.7. | 16.15-17.45 | Tutorial 11 | G309 | Christoph Kling | |
23.7. | 12.15-13.45 | Q&A | B017 | Steffen Staab & Christoph Kling |
More information coming soon!
Veranstaltungsnummer: 04232
Dozent(in) | Christoph Kling Steffen Staab |
Termin(e) | Di 16.00-18.00 G 309, KO Gebäude G |
Veranstaltungsnummer: 04232
Dozent(in) | Christoph Kling Steffen Staab |
Termin(e) |
|
Lecturer: Prof. Steffen Staab
Tutor: Christoph Kling
Format: Lecture + Practical Programming Excercises
Format: Lecture + Practical Programming Excercises
Exercises:
- Programming with Octave or R
- Programming with Hadoop or Spark
- e.g. Generate word cloud with Hadoop
- Programming with GraphLab
- Working with visualizations
- Histograms, e.g. Zipf distribution of words
- Working with TwitteR (not = Twitter!)
Required background knowledge
- Must have: Capability to program; several university courses in mathematics
- Should have:
- Basic statistics (or willingness to acquire it during the course)
- C++ programming (or be eager to acquire it if you are already a Java expert)
Read [1] for a good start of what data science is about.
Exercises - General Rules
The exercises will be done in groups of two students. For taking part in the exam, solutions for all but two exercises have to be submitted. For this, each group will get an own SVN repository.
Every group has to present one exercise, which can be chosen in the tutorial one week before. If no one volunteers, groups are chosen at random. All group members have to present a part of the exercise.
The chosen presentation appointment is mandatory. This means, if a group member is not present, then he/she has to be excused (e.g., by an medical certificate). Otherwise, the missing person will not be allowed to participate in the exam.
Practical Programming Excercises
During the practical programming exercises some Apache Hadoop programs should be written and executed. If you want to execute these programs on your local computer, you have to install and configure Hadoop localy.
Literature
- Vasant Dhar. Data Science and Prediction. In: Communications of the ACM, December 2013, Vol. 56, No. 12, pp. 64-73
- Anand Rajaraman, Jeffrey Ullman, Jure Leskovec, Mining of Massive Datasets, Cambridge University Press (free download)
- Jeffrey Stanton, Introduction to Data Science (free download)
- John Hopcroft. Foundations of Data Science.
- * http://www.wolframscience.com/thebook.html
- Peter Norvig, Alon Halevy, Fernando Parreira. The unreasonable effectiveness of data. In: IEEE Intelligent Systems, March/April 2009.
Outdated List of Topics
- Data science:
history and background, change of paradigm from statistics to programming - Problem scenarios:
Our problem scenarios will mostly happen with open data, such as found on the Web and open statistical data (such as provided by governments etc.), e.g.:- EU Open Data Portal
- eLisa
- Linked Open Data
- Medical data analysis
- Psycholinguistics (e.g. Beatles.pdf, SuicidalPoets.pdf)
- Background in statistics
(cf. Introduction to Web Science) here, we will go into more details of computing statistics and determining the quality of a probabilistic model. In particular, we will look at a whole set of distributions:- Uniform distribution
- Normal distribution
- Exponential distribution
- Power law distribution
- Poisson distribution
- Log normal distribution
And we will look at quality measures such as:
- Students' t-test (valid only for normal distributions)
- Chi square
- ANOVA
- Kulback-Leibler and Jensen-Shannon
- Kolmogorov-Smirnovv
- Hypothesis driven research
- Hypothesis testing
- Statistics fallacies (The theory of the stork)
- Applications, e.g. Web portal promotions, does not work everywhere (Jure Leskovec, Bernardo Huberman, Lada Adamic. The dynamics of viral marketing, Proc. of ACM EC 2006)
- Programming paradigms
- Relational and NoSQL Database Management Systems
- Parallel task processing: Gridgain
- MapReduce (Hadoop/Spark)
- Graph Paradigms (GraphLab, neo4j, RDF Databases)
- Homomorphic machine learning
- Visualization
- Power of visualization: TED Talk by Hans Rosling
- Simple machine learning on large scale data
- Example application domain: text
- b-grams
- n-grams
- generalized n-grams (gappy n-grams)
- Privacy