You are here

DFG Project: EXCITE - Extraction of Citations from PDF Documents

The shortage of citation data for the international and especially the German social sciences is well known to researchers in the field and has itself often been subject to academic studies. Citation data is the basis of effective information retrieval, recommendation systems and knowledge discovery processes. The accessibility of information in the social sciences lags behind other fields (e.g. the natural sciences) where more citation data is available. The EXCITE project aims to close this gap by developing a tool chain of software components for reference extraction which will be applied on existing scientific databases (esp. full texts in the social sciences). The tools will be made available to other researchers. The project will develop a number of algorithms for extracting references and citations from PDF full texts. It will also improve the matching of reference strings to bibliographic databases. The extraction of citations will be implemented as a five step process: 1) Extraction of text from the source documents, 2) identification of reference sections in the text, 3) segmentation of individual references in fields such as author, title, etc., 4) matching of reference strings against bibliographic databases, 5) export of the matched references in usable formats and services. Special attention will be paid to the optimization of individual components of the citation extraction. This will be done with the help of machine learning methods which control the quality of the extracted data of the individual components. The extracted citation data will be integrated into the services maintained by the proposers (sowiport and Related­Work.net) and published as linked open data under permissive licenses to enable reuse. The resulting software of this project will be published under open source licenses and made accessible via a WebService API.

Platforms

Excite integrates and develops methods and applies them on several platforms including:

Outcomes

Method for Distantly Supervised Author Extraction from Social Science Research Papers using Conditional Random Fields

To help in the creation of citation information for the German social sciences, we contribute an approach for extracting author names from reference sections. Instead of relying on small amounts of manually labeled data, we use a distantly supervised approach to automatically generate a partially labeled training data set. Generalized expectation criteria provide a suitable objective function to learn conditional random fields (CRFs) using such partially labeled data. The resulting model does not only decide if a word is part of an author, but also separates the listed authors and distinguishes between their first and last names.

Results:

For an evaluation, 54 reference sections were extracted from PDF files as text and authors were manually labeled. The CRF models for the author extraction were learned on up to 16470 reference sections using different configurations. For the classification of the 7055 manually labeled author words, our best model achieves a recall of 95.5% with a precision of 92.5%. The results further suggest ways of influencing the trade-off between the precision and recall of the model based on its configuration.

Publications:

Martin Körner, Author Extraction from Social Science Research Papers Using Conditional Random Fields and Distant Supervision, Master's Thesis, University of Koblenz-Landau, 2016.

Community Workshop 2017 at GESIS in Cologne

General Information

Operational time:

  • September 2016 - August 2018

Financial baker:

  • DFG - Deutsche Forschungsgemeinschaft

Partner:

Beteiligte: 

Martin Körner

mkoerner@uni-koblenz.de

Prof. Dr. Steffen Staab

staab@uni-koblenz.de

Dr. Heinrich Hartmann

heinrich@heinrichhartmann.com

Azam Hosseini

azam.hosseini@gesis.org

Dr. Philipp Mayr

philipp.mayr@gesis.org

Behnam Ghavimi

behnam.ghavimi@gesis.org