Institute for Web Science and Technologies · Universität Koblenz
Institute WeST
This course is from a past or future semester. If you are looking for current courses, go to the course overview.

Forschungspraktikum/Projektpraktikum "Machine Learning Application"

[go to overview]

Summer Term 2020

In this research lab, you are going to build a complete machine learning system following the generic pipeline in order to solve a specific problem. For each phase in this pipeline, you will adopt the methods and techniques being learnt in Machine Learning and Data Mining course. Therefore, completing this lecture is mandatory. Moreover, other fundamental approaches will also be used when necessary [1], including other sophisticated and modified approaches from the state-of-the-art.

Important Information

To whom?

Master and Bachelor students in:

  • Web Science
  • Computer Science
  • Computational Visualistics
  • Business Informatics

Kick-off / Introductory meeting

  • When: February 20 at 14:00 (your presence is mandatory)
  • Where: B 016
  • Slides

How to register?

  • Form a group of four people to work on a topic
  • Give a name to your group
  • Send (one) email to boukhers@uni-koblenz.de with the subject: MLA registration request (group) before the kick-off
  • In the email, you need to state the topics by order of preference from most preferred one.
  • Attend the introductory meeting
  • After the topic is assigned, write a proposal (up to two pages), describing your potential solution.
  • Register to the exam

Important note: If you could not form a group, you may still take part in the research lab. However, you will have to work with other people who couldn’t form groups. Please send an email to boukhers@uni-koblenz.de with the subject: MLA registration request (indivdual)

Exam

  • When:----
  • Where:----
  • Type: Presentation + Report + Software
  • Registration (Klips): Open from ---- to ---- (Do Not miss the deadline!)
  • Cancellation (Klips): Until ----

Topics

Topic 1

  • Title: Paragraph segmentation
  • Main advisor: Zeyd Boukhers
  • Description: In this topic, you will build a machine learning system to recognize paragraphs in text lines that are extracted from PDF documents. The pdf-to-text extractor provides the content of each line independently. Each line is associated with some features (e.g. length, position, width, etc). The available dataset is not labelled. Therefore, you will need to labbel some documents in order to build a supervised or semi-supervised model, or you can also apply an unsupervised model. Moreover, it is necessary to remove noise artifacts from the documents, such as page numbers. More details will be provided in the introductory meeting.

Topic 2

  • Title: Metadata extraction from German scinetific papers
  • Main advisor: Zeyd Boukhers
  • Description: In this topic, you will build a model that extracts the metadata from scientific papers (PDF format) such as the title, author names, institute and abstract. You will make a labbeled data which is not an expensive task for such a task. The model has to handle different templates and different font types. The input of this model is a PDF document, where its output is the metadata.

Topic 3

  • Title: Optimising online documents for fact-checking
  • Main advisor: Ipek Baris
  • Description: In this task, you will implement a web application for optimising fact-checking. The application will first check given url whether has been already fact-checked against fact-checking organisations, if it is not, the text mining module of the system will evaluate the full article of url. And finally the system will rank the url with other urls which have not been fact-checked. The system baseline which you compare will be ClaimPoster [2], and the baseline of text mining module will be online nutrition label extractor [3]. You are expected to implement your novelty method at least 3 category which you choose in online nutrition label.

Topic 4

  • Title: False Article Detection with Weakly Supervised Learning
  • Main advisor: Ipek Baris
  • Description: This task aims to predict the full text article whether fake or not. You will investigate weakly supervised learning methods which is popular in computer vision, and adopt them on natural language processing task. You will use the datasets and methodology which is described in [4] as baseline.

Topic 5

  • Title: Entity recognition and linkage for reference data
  • Main advisor: Zeyd Boukhers
  • Description: The goal of this topic is to link the attributes (i.e. author, journal, publication, etc.) to their entities in knowledge bases (e.g. wikidata, ORCID, etc.). For this, you will make a synthetic a data from DBLP and/or Crossref consisting of erroneous and incomplete reference strings. The challenge of this topic is that many entities might match a unique attribute. Therefore, you will need to employ the other attributes to retrieve only the correct entity. The input of the developed model will be a reference string. With the help of a parsing API, the output should be the identifiers (links to the entities) of every attribute.

Topic 6

  • Title: PAN task on profiling fake news spreaders on Twitter
  • Main advisor: Ipek Baris
  • Description: In this task, you will perform author profiling on Twitter users who spread fakenews and who does not. In author profiling, you will investigate characteristics of users (age, gender, style of writing, timeline analysis, etc.), and train a model which automatically detect fakenews spreader before they spread. This task is part of PAN shared task for more information about data, please check the link [5] </strong>

Templates

For your final reports, please use this template. Your reports should be max 8 pages. For the detail information you can use the Appendix section. Please read the following tips.

References

[1] Sergios Theodoridis and Konstantinos Koutroumbas. 2008. Pattern Recognition, Fourth Edition (4th ed.). Academic Press, Inc., Orlando, FL, USA. (More than 10 copies are available in the library)

[2] Majithia et al., 2019 ACL, ClaimPortal

[3] Fuhr et al. 2018 ACM SIGIR, An Information Nutritional Label for Online Documents

[4] https://github.com/isspek/weakly_misinformation_learning/tree/master/documents

[5] https://pan.webis.de/clef20/pan20-web/author-profiling.html

Lecturers

  • contact@boukhers.com
  • Alumnus
  • B 104
  • +49 261 287-2765