Transfer Learning from English to a Low-resourced Language for Detecting Hate Speech on the Web[go to master theses]
Open Master Thesis - Contact a supervisor for more details!
Some democracies have a tendency to decline. Post-authoritarian societies are not fully censored but are conflicted by controversy about their political leaders due to unresolved histories of oppression. This pattern is mostly discussed in the fields of democratization studies (in political science) and political polarization (in political science and CSS). However, the link between both fields is overlooked, even though nearly all young democracies show interlinked patterns. For example, partisan citizens will evaluate current affairs based on their sympathy or hate towards the current president. Relatively anonymous online debates give insight into argument structures that link political leaders with hate speech. A possible hypothesis for studying this link is: News articles about the President will attract hate speech about Syrian immigrants because partisans want to attack the state leader by attacking current affairs.
Eksisozluk is the largest Turkish online community and combines a collaborative dictionary with user networking. In Eksisozluk, a user initiates a thread which might be about breaking news and other users comment on the thread. Technically, it offers a challenge for low-resource languages and language-blind analysis. Turkish is a low-resource language compared to English, and the concentration of the field on English data means that diverse types of democratic implications remain overlooked.
The aim of the thesis is as follows: a) Write a survey of the link between young democracy and political polarization from the above fields of study, and derive your hypothesis. b) Apply transfer learning from English to Turkish for various NLP tasks including hate speech detection, thus providing a potential blueprint for similar tasks in the future.
Firstly, the student collects false/true news from a Turkish fact-checking website and corresponding conversational threads from Eksisozluk. This corpus will be annotated by the student with Hatebase API.
Secondly, the student will represent texts with cross-linguistic word embeddings [Ruder, 2017] to be fed into a neural-based hate speech classifier in order to transfer knowledge learned from high resource language which is English to low resource language Turkish. To compute cross-linguistic embeddings, the student firstly computes monolinguistic embeddings of both Turkish and English and then maps them to shared domain space [Stephan, 2015]. The student evaluates his/her system on HatEval 2019 as well as Eksisozluk corpora, and compares with baseline and state of art models on hate speech task.
[Ruder, 2017] Ruder, Sebastian, Ivan Vulić, and Anders Søgaard. "A survey of cross-lingual word embedding models." arXiv preprint arXiv:1706.04902 (2017).
Gouws, Stephan, Yoshua Bengio, and Greg Corrado. "Bilbowa: Fast bilingual distributed representations without word alignments." (2015).
Iyengar, Shanto, and Sean J. Westwood. "Fear and loathing across party lines: New evidence on group polarization." American Journal of Political Science 59.3 (2015): 690-707.