RULE-BASED ANNOTATION OF LITHUANIAN TEXT CORPORA

Jurgita Kapočiūtė; Gailius Raškinis

Authors

Jurgita Kapočiūtė Vytautas Magnus University
Gailius Raškinis Vytautas Magnus University

Abstract

In this paper we present an algorithm that automatically recognizes and annotates person and place names, contractions, acronyms, foreign language phrases, dates and sentence boundaries in Lithuanian texts. The algorithm is based on a set of manually developed template matching rules and a few specialized lexicons. The algorithm performs annotation by making several passes over the text. It can operate in automatic and semi-automatic annotation modes. In the semi-automatic annotation mode, the user is allowed to intervene in cases where automatic decision is uncertain. Users’ feedback is memorized and stored in the lexicons. Rules and lexicons were developed after a careful examination of the text corpus of 600 thousand words. The algorithm was evaluated on a separate corpus of 400 thousand words and achieved ~93% annotation accuracy.

RULE-BASED ANNOTATION OF LITHUANIAN TEXT CORPORA

Authors

Abstract

Downloads

Published

Issue

Section

License

Information