Part-of-Speech Tagging with Minimal Lexicalization
Title: Part-of-Speech Tagging with Minimal Lexicalization
Abstract: This research explores an alternative to lexicalization in Part-of-Speech (PoS) tagging. Using linguistic knowledge, the study proposes a minimalist tagger with a small but efficient feature set that maintains a reasonable performance across corpora. The authors compare their approach with previous methods, such as Brill's rule-based tagger and statistical taggers like TnT and Hidden Markov Models (HMMs), and demonstrate that their method achieves state-of-the-art accuracy with a significantly smaller vocabulary.
Research Question: Can a minimalist tagger with a small feature set maintain a reasonable performance in Part-of-Speech tagging, while also generalizing better across corpora compared to existing methods?
Methodology: The authors use a Dynamic Bayesian Network (DBN) to represent a variety of sublexical and contextual features relevant to PoS tagging. This approach allows for the compact representation of features and eliminates redundancy, resulting in a flexible tagger (LegoT) with state-of-the-art performance. They also explore the effect of eliminating redundancy and radically reducing the size of feature vocabularies. Furthermore, they investigate the use of a minimal lexicon limited to functional words and show that this approach ensures reasonable performance.
Results: The study finds that a small but linguistically motivated set of suffixes results in improved cross-corpus generalization. The LegoT tagger achieves an error rate of 3.6% on a benchmark corpus, outperforming previous methods like Brill's tagger and statistical taggers. The authors also demonstrate that a minimal lexicon limited to functional words is sufficient to ensure reasonable performance.
Implications: The research suggests that a minimalist approach to PoS tagging can achieve state-of-the-art accuracy with a significantly smaller vocabulary. This approach not only reduces the reliance on lexicalization but also improves generalization across corpora. The study's findings have implications for the development of future natural language processing systems, as they provide a more efficient and effective method for PoS tagging.
Link to Article: https://arxiv.org/abs/0312060v1 Authors: arXiv ID: 0312060v1