Environmental sciences study, among other topics, the toxicity of man-made chemicals appearing in nature and diminishing biodiversity. In order to advance science and consult politics, it is necessary to achieve a global picture of toxic contamination. While no single research team can take probes all over the globe, many measurements of chemicals appear in scientific texts. In meta-analyses of scientific texts, researchers in environmental sciences manually extract such information in order to study chemical pollution at national or even at global scale.
In addition to the high cost of the manual extraction both in terms of time as well as money, it is very likely to miss either the measurements in a relevant paper or relevant papers themselves. Combining exhaustive and systematic search strategies is vital for achieving a satisfactory extraction result with acceptable time consumption. Unfortunately, manual extraction, for this particular problem, cannot ensure a high-quality extraction in a reasonable time.
Therefore, involving an automatic system is essential in this task, by expanding the list of manual search clues and adopting machine learning techniques. Regarding the known clues, the measurements are usually associated with related keywords, such as toxicity, water, toxicant concentration, etc. Furthermore, the measurement values are mostly followed by their units such as µg/L or LD50. From a methodological perspective, information extraction from text files has been widely used with the support of machine learning approaches notably Conditional Random Fields (CRF) and Hidden Markov Model (HMM).
The topic of this master project lies in introducing an automatic/semi-automatic system to extract the measurement values of toxicity from research papers. This can be achieved by analysing and understanding the data in order to find more clues. For a robust system, machine learning techniques will be applied and tested.