September 20, 2009
Center for Computational Engineering and Networking
The 2009 Tamil Internet Conference organized by the Institute of Indology and Tamil Studies of the University of Cologne in Germany next month will see scholars from countries around the world converging to discuss issues related to computing in the Tamil language.
Amrita will be represented as well; CEN research associates will present their research papers, as noted below.*
Open Source Software and Localization, Machine Translation, OCR and Voice Recognition, E-Learning, Tools for Tamil Computing, Tamil Enabling in Mobile Phones, Digital Archiving of Tamil Heritage Materials, Standards for Tamil Computing are some topics that will be addressed at the conference. “We will have the opportunity to demonstrate our linguistic tools to international experts,” stated Dr. K. P. Soman, CEN Director.
Natural Language Processing (NLP) is a major area of research focus for CEN. The Center has developed and made available for download several computational linguistic tools, some of which will be demoed at the conference. Recently the Center received funding from the Ministry of Information Technology to continue this work. This conference will provide a platform to showcase the work to international experts of Tamil Linguistics and Indology.
Dr. A. G. Menon, S. Saravanan, R. Loganathan, Dr. K. P. Soman, Amrita Morph Analyzer and Generator for Tamil: A Rule-Based Approach
Morphology deals primarily with the structure of words. Morphological analysis detects, identifies and describes the meaningful constituent morphs in a word, that function as building blocks of a word.
Densely agglutinative Dravidian languages such as Tamil, Malayalam, Telugu and Kannada display a unique structural formation of words by the addition of suffixes representing various senses or grammatical categories, after the roots or stems. The senses such as person, number, gender and case are linked to a noun stem in an orderly formation by these building blocks.
Verbal categories such as transitive, causatives, tense and person, number and gender are added to a verbal root or stem. The morphs representing these categories have their own slots behind the roots or stems. The highly complicated nominal and verbal morphology does not stand alone. It regulates the direct syntactic agreement between the subject and the predicate.
Another important aspect of the addition of the building blocks or morphs in an orderly way to form a word is the change which often takes place in the space between these morphs. We may call this change as Sandhi Change. A Morphological Analyser and Generator should take care of these changes while assigning a suitable morph in the correct position to generate a word. The combination of sense and form in a morph and the possibility to identify the rules operating in a language to produce an utterance of maximum mutual intelligibility are the incentives to attempt to build an engine which can automatically analyse and generate the same processes taking place in the brain of a native speaker. Artificial Intelligence is an essential component of any Machine Translation (MT) and NLP.
Morph Analyser and Generator (MAG) is an important tool in NLP and MT. An accurate pathological analysis of a language and the rules formulated on the basis of this analysis form the backbone of this approach. The degree of predictability of correct translation is enhanced through the formulation of rules. Such an approach is highly productive when the source language is analytic and the target language is agglutinative. In such a context it is necessary to link the senses in the source language with those of the target language. For example English predicate such as âcameâ contains only two senses: 1. coming and 2. Past tense. Whereas a Tamil predicate such as âvantaanâ reflects more senses such as 1. coming, 2. Past tense, 3. Person, 4. Number and 5. Gender. In such cases the morphological analyser together with the syntactic parser will link the senses of the two languages and produce a correct translation.
Amrita has developed two rule based MAGs â one for the Tamil Noun and the other for the Tamil Verb. The technology employed to build these tools is FST.
Finite State Transducer (FST) is used for morphological analyzer and generator. FST maps between two sets of symbols. It can be used as a transducer that accepts the input string if it is in the language and generates another string on its output. The system is based on lexicon and orthographic rules from a two-level morphological system. For the Morphological generator, if the string which has the root word and its morpheme information is accepted by the automaton, then it generates the corresponding string, root word and morpheme units in the first level. The output of the first level is taken as the input to the second level where the orthographic rules are handled, and if it gets accepted then it generates the inflected word.
Dhanalakshmi, Anand Kumar M., Dr. S. Rajendran, Dr. K. P. Soman, POS Tagger and Chunker for Tamil Language
This paper presents the Part-of-Speech Tagger and Chunker for Tamil using Machine Learning techniques. Part-of-Speech (POS) tagging and chunking are the fundamental processing steps for any language processing task. POS tagging is the process of labeling automatic annotation of syntactic categories for each word in a corpus. Chunking is the task of identifying and segmenting the text into syntactically correlated word groups. Both are done by machine learning techniques, where the linguistical knowledge is automatically extracted from the annotated corpus.
We have developed our own tagset for annotating the corpus, which is used for training and testing the POS tagger generator and the chunker. The present tagset consists of thirty-two tags for POS and nine tags for chunking. A corpus size of two hundred and twenty five thousand words was used for training and testing the accuracy of the POS Tagger and Chunker. We found that SVM based machine learning tool gives highly encouraging result for Tamil POS Tagger (95.64%) and Chunker (95.82%).
Anand Kumar M., Dhanalakshmi, Dr. S. Rajendran, Dr. K. P. Soman, A Novel Approach to Morphological Analysis for Tamil Language
This paper presents a new and novel methodology for morphological analysis of Tamil language using machine learning. Morphological analysis is concerned with retrieving the structure, syntactic rules, morphological properties and the meaning of a morphologically complex word.
The morphological structure of an agglutinative language is unique and capturing its complexity in a machine analyzable and generatable format is a challenging job. Generally rule-based approach is used in building a morphological analyzer. In rule-based approach what works in the forward direction may not work in the backward direction.
This novel morphological analyzer is based on sequence labeling and training by kernel methods. It captures the non-linear relationships and various morphological features of Tamil language in a better and simpler way. The efficiency of our system is compared with the existing morphological analyzers which are available online. Our system significantly outperforms the existing morphological analyzers and achieves a very competitive accuracy of 95.65% for Tamil language.