Institute for Web Science and Technologies · Universität Koblenz - Landau
Institute WeST
This course is from a past or future semester. If you are looking for current courses, go to the course overview.

Data Science

[go to overview]

Summer Term 2014

Data scientists: IT's new rock stars

Data Science (cf. the Wikipedia definition of data science) describes an attitude towards treating problems with a set of capabilities that is not located in any classic community, but it is a set of capabilities that cross-breed between disciplines, such as physics, biology, social sciences and economics. It uses elaborate computer science paradigms and needs a background in statistics. It feeds the new as well as the classical economy as well as the medical field. 


  1. Data science:
    history and background, change of paradigm from statistics to programming
  2. Problem scenarios:
    Our problem scenarios will mostly happen with open data, such as found on the Web and open statistical data (such as provided by governments etc.), e.g.:
    1. EU Open Data Portal
    2. eLisa
    3. Linked Open Data
    4. Medical data analysis
    5. Psycholinguistics (e.g. Beatles.pdf, SuicidalPoets.pdf)
  3. Background in statistics 
    (cf. Introduction to Web Science) here, we will go into more details of computing statistics and determining the quality of a probabilistic model. In particular, we will look at a whole set of distributions:
    • Uniform distribution
    • Normal distribution
    • Exponential distribution
    • Power law distribution
    • Poisson distribution
    • Log normal distribution

    And we will look at quality measures such as:

    • Students' t-test (valid only for normal distributions)
    • Chi square
    • ANOVA
    • Kulback-Leibler and Jensen-Shannon
    • Kolmogorov-Smirnovv
  4. Hypothesis driven research
    1. Hypothesis testing
    2. Statistics fallacies (The theory of the stork)
    3. Applications, e.g. Web portal promotions, does not work everywhere (Jure Leskovec, Bernardo Huberman, Lada Adamic. The dynamics of viral marketing, Proc. of ACM EC 2006)
  5. Programming paradigms
    1. Relational and NoSQL Database Management Systems
    2. Parallel task processing: Gridgain
    3. MapReduce (Hadoop/Spark)
    4. Graph Paradigms (GraphLab, neo4j, RDF Databases)
    5. Homomorphic machine learning
  6. Visualization
    1. Power of visualization: TED Talk by Hans Rosling 
  7. Simple machine learning on large scale data 
  8. Example application domain: text
    • b-grams
    • n-grams
    • generalized n-grams (gappy n-grams)
  9. Privacy