A Generic Approach for Reference Extraction from PDF Documents

Extracting and parsing cited references from publications in PDF format is important to ensure the acknowledgement of the sources of information. However, the mention of these sources differs from a community to another and from a publication to another. This citation diversity lies mainly in the indexation style (e.g., one or several reference sections), the existence of components (e.g. editor, source, URL, etc.) and the type of references (e.g. grey literature, academic literature, etc.). In order to accurately extract and segment difference kinds of references, EXCITE proposes a generic approach that combines Random Forest and Conditional Random Fields (CRF) in a coherent mechanism. Random Forest is employed for the initial classification of each line in the document, whereas CRF segments the potential reference lines into basic components. Here, different line combinations are iteratively assessed in order to obtain the proper combination with the help of a probabilistic approach.

09.08.2018 - 10:15
