Unified Word Embedding for Code Switching Data[go to overview]
The impact of intercultural exposure can be observed in multi-lingual societies where bilingual speakers often exhibit a switch/mix of grammar and lexicon of more than one language. This phenomenon is an inevitable outcome of language contact and this switch can be observed between sentences (inter-sentential), within a sentence (intra-sentential) or even at the word level.
Recent research in code-switching (CS) has turned towards utilization of neural networks and word embeddings for these low-resourced languages. The standard approach for a given CS corpora combines information from various existing distributional representations pre-trained on source languages with neural network models to achieve varying degrees of success.
This thesis was undertaken to argue the applicability of existing monolingual word representations for code-switching tasks. It hypothesizes the benefits of leveraging inherent structure of code mixed languages for down stream tasks. Additionally, it explores and utilizes the unique morphology exhibited by CS data which extends and merges the structure of source languages as training information. To this extent two models were developed, a joint-cross lingual and a sub-word level embeddings model to encapsulate information at various granularity. The results were compared against standard word2vec and fastText embeddings. The models had access to an unlabeled code-switching text corpus and used language-modeling as an objective to learn the weights.
Findings of the research demonstrate an improvement in bilingual lexicon induction, word similarity ”intrinsic” and opinion mining ”extrinsic” evaluation task. Experiments show that sub-word embeddings outperform baseline and joint model for domain independent tasks where as joint model captures cross lingual information better. This revised view of code-switching languages warrants further investigation and opens a new avenue for potential research.
23.05.19 - 10:15