An Akkadian Word Segmentation approach using Chinese Word Segmentation Methods[go to overview]
Word segmentation can be considered as a fundamental task in natural language processing. The process to tokenize a given text is the rst step in further analyzing it using for example POS-Tagging, semantic analysis or automated translation mechanisms. While word segmentation in languages using the Latin alphabet is considerably trivial (words can usually be clearly distinguished by stopchars (e.g. whitespaces)), word segmentation in languages not using the Latin alphabet needs further analysis. In the last 20 years research in segmenting Chinese texts has been successful in a way that this problem is considered solved nowadays. This presentation will be about applying ways to segment Chinese texts to an ancient semitic language: Akkadian. The Akkadian language was used 28th-4th century BC in todays Iraq and Syria and is written in cuneiform. Cuneiform does not contain any stop chars (dot,comma,whitespace) but as it is written on clay tablets, words usually end on the tablets edges. To segment Akkadian texts rulebased approaches, dictionarybased approaches, statistical approaches and approaches using Machine Learning will be presented and the corresponding tests will be illustrated. Furthermore the evaluation process and further developments as well as side projects emerging from specifc needs are a part of the presentation. At last an overview about the segmentation results and possible future perspectives in research in this eld will be outlined and discussed.
23.01.15 - 10:15