Coping with Limited Training Data in Verb Phrase Ellipsis Detection using Machine Learning[go to overview]
In both conversation and writing, grammar gives us the opportunity to avoid articulating parts of a sentence, which are overtly expressed in the preceding linguistic context. For instance, in the sentence, /I wanted to play football but I couldn’t/, after /couldn’t/, /play football/ can be dropped because it can be understood from the context. In linguistics, this phenomenon is known as verb phrase (VP) ellipsis. Detection and resolution of ellipsis lead towards understanding text properly which could be helpful to improve language understanding systems. Since this phenomenon is optional, the challenge was to find a way to systematically distinguish auxiliaries and modals that indicate VP ellipsis from auxiliaries that do not. We modeled this problem as a binary classification problem and proposed a feature efficient machine learning based approach which yielded an improvement of 4.06% in the F1-score, compared to state-of-the-art results in VP ellipsis detection task. Although machine learning based models can be used to detect VP ellipsis, these models require a significant amount of annotated training data. Since VP ellipsis is a rather rare phenomenon, it is pretty hard to collect naturally occurring data on a sufficient scale. To cope with this problem, we presented an active learning based approach which incorporates a small amount of annotated data, as well as a large amount of unlabeled data to produce a better classification model. Results obtained from active learning based model showed that similar performance on the test data could be achieved while saving up to 80% of the annotation effort.
23.08.18 - 10:15