Evaluation of Model and Hyperparameter Choices in word2vec[go to overview]
When processing natural language on the computer, vector representations of words have many fields of application. They enable a computer to score the similarity between two words and allow to determine missing words for an analogy. In 2013 [Mikolov et al., 2013a] published an algorithm that is called word2vec to create such vector representations. It was able to create vector representations that far exceeded the performance of earlier methods. This thesis explains the algorithm’s popular skip-gram model from the perspective of neural networks. Furthermore, a literature study is performed, examining the literature qualitatively and quantitatively figure out how word embeddings can and should be evaluated and to show the difference between proposed and actually applied evaluation methods. Thus, the thesis provides insights that enable us to identify and suggest best practices for the evaluation of these word vector representations. We identify the similarity task, analogy task, and tasks based on a downstream machine learning algorithm that uses word vectors as data input representation as the three most important task categories for evaluating our word representations. In addition, the thesis shows which data sets are used for the evaluation of the similarity and the analogy task and further breaks down the downstream machine learning tasks used. The identified best practices are used to evaluate our own experiments to evaluate the effects for some small model and hyperparameter changes for the word2vec algorithm. The experiments results reveal that word representations for very rare words often are of bad quality and suggest vocabulary-sizes from 100,000 to 1,000,000 as reasonable choices. The evaluation also shows that embeddings created using a lowercase text corpus perform excellently on the similarity task but show very poor performance on the analogy task. Weak performances are also shown by our experiments modifying word2vec’s neural network model with the Adam optimizer or with dropout on the word2vec’s embedding layer. Finally, we show that with word2vec especially rare words benefit from a high learning rate.
18.04.19 - 10:15