POS Tagger and Chunker for Tamil Language
Publication Type:Conference Proceedings
Source:Proceedings of the 8th Tamil Internet Conference, Cologne, Germany (2009)
This paper presents the Part Of Speech tagger and Chunker for Tamil using Machine learning techniques. Part Of Speech tagging and chunking are the fundamental processing steps for any language processing task. Part of speech (POS) tagging is the process of labeling automatic annotation of syntactic categories for each word in a corpus. Chunking is the task of identifying and segmenting the text into syntactically correlated word groups. These are done by the machine learning techniques, where the linguistical knowledge is automatically extracted from the annotated corpus. We have developed our own tagset for annotating the corpus, which is used for training and testing the POS tagger generator and the chunker. The present tagset consists of thirty-two tags for POS and nine tags for chunking. A corpus size of two hundred and twenty five thousand words was used for training and testing the accuracy of the POS tagger and Chunker. We found that SVM based machine learning tool affords the most encouraging result for Tamil POS tagger (95.64%) and chunker (95.82%).