RULE-BASED ANNOTATION OF LITHUANIAN TEXT CORPORA

Authors

  • Jurgita Kapočiūtė Vytautas Magnus University
  • Gailius Raškinis Vytautas Magnus University

Abstract

In this paper we present an algorithm that automatically recognizes and annotates person and place names, contractions, acronyms, foreign language phrases, dates and sentence boundaries in Lithuanian texts. The algorithm is based on a set of manually developed template matching rules and a few specialized lexicons. The algorithm performs annotation by making several passes over the text. It can operate in automatic and semi-automatic annotation modes. In the semi-automatic annotation mode, the user is allowed to intervene in cases where automatic decision is uncertain. Users’ feedback is memorized and stored in the lexicons. Rules and lexicons were developed after a careful examination of the text corpus of 600 thousand words. The algorithm was evaluated on a separate corpus of 400 thousand words and achieved ~93% annotation accuracy.

Downloads

Published

2005-09-10

Issue

Section

Articles