Institute for Web Science and Technologies · Universität Koblenz - Landau
Institute WeST

Predicting Foreign Users from English Conversations on Social Media

[zur Übersicht]
Alexander Winkens

Alexander Winkens will defend his bachelor thesis about “Predicting Foreign Users from English Conversations on Social Media”. The talk is open for the university audience. Due to the current situation, everybody who wants to attend the talk must register via E-mail to ibaris@uni-koblenz.de until 19th August, so we know who will attend and how many people to expect. See the official statement by university for information how to behave on campus in the current situation: https://www.uni-koblenz-landau.de/de/coronavirus

Social media platforms such as Twitter or Reddit allow users almost unrestricted access to publish their opinions on recent events or discuss trending topics. While the majority of users approach these platforms innocently, some groups have set their mind on spreading misinformation and influencing or manipulating public opinion. These groups disguise as native users from various countries to spread frequently manufactured articles, strong polarizing opinions in the political spectrum and possibly become providers of hate-speech or extremely political positions. This thesis aims to implement an AutoML pipeline for identifying second language speakers from English social media texts. We investigate style differences of text in different topics and across the platforms Reddit and Twitter, and analyse linguistic features. We employ feature-based models with datasets from Reddit, which include mostly English conversation from European users, and Twitter, which was newly created by collecting English tweets from selected trending topics in different countries. The pipeline classifies language family, native language and origin (Native or non-Native English speakers) of a given textual input. We evaluate the resulting classifications by comparing prediction accuracy, precision and F1 scores of our classification pipeline to traditional machine learning processes. Lastly, we compare the results from each dataset and find differences in language use for topics and platforms. We obtained high prediction accuracy for all categories on the Twitter dataset and observed high variance in features such as average text length especially for Balto-Slavic countries.


21.08.20 - 13:00
E 313