RULE-BASED ANNOTATION OF LITHUANIAN TEXT CORPORA
Abstract
In this paper we present an algorithm that automatically recognizes and annotates person and place names, contractions, acronyms, foreign language phrases, dates and sentence boundaries in Lithuanian texts. The algorithm is based on a set of manually developed template matching rules and a few specialized lexicons. The algorithm performs annotation by making several passes over the text. It can operate in automatic and semi-automatic annotation modes. In the semi-automatic annotation mode, the user is allowed to intervene in cases where automatic decision is uncertain. Users’ feedback is memorized and stored in the lexicons. Rules and lexicons were developed after a careful examination of the text corpus of 600 thousand words. The algorithm was evaluated on a separate corpus of 400 thousand words and achieved ~93% annotation accuracy.
Downloads
Published
Issue
Section
License
Copyright terms are indicated in the Republic of Lithuania Law on Copyright and Related Rights, Articles 4-37.