Computational Linguistics and Natural Language Processing
Creation of Machine Translation Tools and Resources for English to Dravidian Languages
This project is an initiative of Ministry of Human Resource Department under National Mission on Education through ICT.
Natural Language Processing (NLP) is about making computers learn natural (human) languages. Practical applications of NLP are plenty -- automatic translation of text from one language to another, translating speech from one language to another (Speech to Speech Translation), reading out written text (Text to Speech), automatically producing a transcript for speech (Speech Recognition), answering questions posed by the user (Question Answer) -- to name a few.
The objective of this project is to develop Machine Translation(MT) system for English-Indian languages(Tamil, Malayalam, Telugu and Kannada), Indian-Indian(Malayalam - Tamil) languages and linguistic resources, that would facilitate the creation of rich educational contents in Indian languages. The research effort is to make all the tools and translation system to be based on Machine Learning methodologies so that computer graduates and other such non-linguists are able to immediately participate in the national mission on literacy by contributing additional tools for language translation.
The project is being implemented by a consortium of five universities which include IIT Bombay, Dravidian University, University of Hyderabad and Tamil University apart from Amrita Vishwa Vidyapeetham.
Morphological Analyzer / Generator for Tamil Description
Any Natural Language Processing (NLP) application for any language starts with the development of Morphological Analyzer or Word Analyzer, which analyzes the inflected word and provides information such as root word or stem and its constituent morphemes with which the original word was constructed. Building morph analyzers for highly inflectional languages (Indian Languages) is rather difficult but crucial for applications such as Machine Translation (MT) and dialog based natural language understanding systems.
We began the morph analyzer project with Tamil language. Tamil is agglutinative, highly inflectional and rich in morphology. The major inflectional categories in Tamil are nouns and verbs. Noun morphology of Tamil is simple as compared to verb morphology. For instance, a single Tamil verb can take at least 200 forms without including the auxiliary information whereas a noun can inflect for only 8 cases. Extremely simple paradigms were used to categorize the root words. As of now, the current implementation outputs all possible known legitimate splits. The lexicon includes 50,000 nouns and few hundred verbs.
Machine Learning based Morphological Analyzer
Morphological analyzer using machine learning approach for complex agglutinative natural languages is developed. Morphological analysis is concerned with retrieving the structure, the syntactic and morphological properties or the meaning of a morphologically complex word. The morphology structure ofagglutinative language is unique and capturing its complexity in a machine analyzable and generatable format is a challenging job. Generally rule based approaches are used for building morphological analyzer system. In rule based approaches what works in the forward direction may not work in the backward direction. This new and state of the art machine learning approach based on sequence labeling and training by kernel methods captures the non-linear relationships in the different aspect of morphological features of natural languages in a better and simpler way. The overall accuracy obtained for the morphologically rich agglutinative language (Tamil,Malayalam,Telugu) was really encouraging.
Tamil POS Tagger
Part of speech (POS) tagging is the process of labeling a part of speech or other lexical class marker to each and every word in a sentence. It is similar to the process of tokenization for computer languages. POS tagging is a well-understood problem in NLP, to which machine learning approaches are applied. The interest in annotated corpora is spreading, as there is increasing concern with using existing machine learning approaches for corpus processing.
We have prepared POS tagged corpora of size two hundred and twenty five thousand words, collecting corpora from Dinamani newspaper, yahoo Tamil news, online Tamil short stories etc. We have designed a new tag set (Amrita Tagset) and used this to create the annotated corpus. There are many POS taggers available for English and other foreign languages, but for Tamil there is no such tagger which gives a good result. We have used SVMTool to develop a POS tagger generator for Tamil language. The overall tagging accuracy was 94.12%.
Malayalam POS Tagger
SVM Based English to Tamil Transliterator
Machine Transliteration is an automatic process of transcribing a word or text written in one writing system into a phonetically equivalent word in another writing system. Machine translation and cross language information retrieval is always in need of efficient mechanisms for machine transliteration especially when proper names and technical terms are involved.
We have used a technique wherein the transliteration problem is modeled as a sequence labeling problem and proceeded to solve this using SVM. We have applied this technique for transliterating English to Tamil and achieved exact Tamil transliterations for 84.16% of English names. We get an accuracy of 93.33% when we choose from the first five ranked transliterations.
Rule Based English to Tamil Transliterator
Dowload Tool »»
We have also developed a Rule Based Transliterator. This transliterator was trained using WEKA Data Mining Tool. It achieved exact Tamil transliterations for 79.23% of English names. We get an accuracy of 91.45% when we choose from the first five ranked transliterations.
Linguistic Tree Viewer in Java
Dowload Tool »»
NLP or linguistic researchers who work with syntax often want to visualize parse trees or create linguistic trees for analyzing the structure of a language. TreeViewer software provides an easy to use interface to visualize or create simple linguistic trees. This software is written entirely in Java.
A comprehensive Wordnet is currently being developed along with Tamil University, Thanjavur and Kuppam University, Kuppam, as a part of Sakshat Amrita project on Machine Translation for Ministry of Human Resources and Development. This project will be linked with the Indo WordNet project headed by Dr. Pushpak Bhattachrya in IIT Bombay.
Wordnet is an online lexical reference system. Its design is inspired by current psycholinguistic and computational theories of human lexical memory. It is a very important tool in the process of Machine Translation. It is also a very handy tool for Linguists and language enthusiasts. Currently the core synsets of the Dravidian languages are available for download from the links given below.
This data can be used along with the Wordnet Tool developed by IIT Bombay.
You can download the tools here:
- Malayalam Wordnet - Developed by team under Dr. K.P. Soman, Amrita University, Coimbatore
- Tamil Wordnet - Developed by team under Dr. Rajendran S, Tamil University, Thanjavur
- Telugu Wordnet - Developed by team under Dr.Arul Mozhi, Dravidian University, Kuppam
- Kannada Wordnet - Developed by team under Dr.Kavi Narayanamurthi, University of Hyderabad.