Hierarchical Reference Disambiguation Using Background Knowledge[go to master theses]
Open Master Thesis - Contact a supervisor for more details!
In academic articles, authors are keen to cite their fellows with the key-identifiers of their work, including author names, titles, year of publication. However, their attention does not prevent the different appearances of a unique reference in multiple documents. This can be due to 1) the different or miss- spelling of some words, 2) using acronyms and initialisms, 3) different order of information or 4) absence of some information. Consequently, many references indicating the same object (i.e. article, book, etc.) cannot be exactly matched. By increasing the matching tolerance, references may be matched with both relevant and irrelevant counterparts. There are multiple reasons that trigger the need to identify the unique references present in a collection of articles, among which: a) linking references to their sources, b) measuring articles and venues’ impacts, c) building a coherent citation network, etc.
In Natural Language Processing (NLP), disambiguation is the task of identifying individual objects handling the problems of synonymy (two objects share the same name) and homonymy (an object can be written in more than one-way). To overcome the problem of identifying unique references, it is crucial to remove token ambiguity (i.e. whether a token belongs to the right entity or not). Afterwards, removing author and venue ambiguities and finally, identifying individual references. Ambiguity can be eliminated with different approaches depending on the problem and the available attributes in the data. Supervised- based techniques are employed when the set of unique objects is predefined (e.g. word-sense disambiguation), otherwise, unsupervised-based techniques are suitable. Constraint-based approaches are also used when rules are known to guide the clustering.
In this master topic, rules will be used to disambiguate tokens, whereas for author and venue disambiguation, a database of author names and venues will be exploited (e.g. DBLP). Since it is not possible to rely on a set of unique references, unsupervised approaches can be applied to disambiguate references.
Keywords: entity disambiguation, entity resolution, entity identification, object distinction, record linkage