Qualification: 
Ph.D
m_anandkumar@cb.amrita.edu

Dr. M. Anand Kumar currently serves as Assistant Professor at Amrita Center for Computational Engineering and Networking (CEN), Coimbatore Campus.

Invited Talk

  • “Machine Translation Tools for Grammar Teaching”, Presented at Tamil Internet Conference 2010, June 2010, Cemmozhi Maanaadu, Coimbatore, India.
  • POS tagging and Morphological analyzer, National Workshop on Computational Linguistics and Machine Translation from English to Indian Languages at AMRITA VISHWA VIDAPEETHAM on 07-10-2012.
  • Factored SMT for English-Tamil, National Workshop on “Computational Linguistics and Machine Translation from English to Indian Languages at AMRITA VISHWA VIDAPEETHAM on 07-10-2012.
  • Machine learning approach for Tamil POS tagging, Workshop on Tamil – POS Tagging at Madurai Kamarajar University on 07-03-2013.
  • Basics in Machine Translation, Course on Introduction to Translation at CIIL, Mysore on 16-12-2013.
  • Basics in Machine Translation, Course on Introduction to Translation at CIIL, Mysore on 04-11-2013.
  • Hybrid Machine Translation System, Lecture on Computer Science Topics at Vidya Academy of Science and Technology on 06-11-2013.
  • “Tamil NLP Tools and resources”, A National Level Workshop on “Complete Understanding of NLP in Tamil Language at CIET, Coimbatore on 18th July 2014.
  • Machine learning approach for morphological analyzer, FDP on Language Technology, November 17-21, 2014, Govt. Engineering College, Sreekrishnapuram, Palakkad
  • “Machine Translation and Linguistic Tools”, "ulagatamilsangam" (World Tamil Association), 02nd February - 2015 at Tamil Development Directorate, Chennai
  • “Linguistic Tools and Text Analytics for Tamil”, Tamil Virtual Academy, Anna university, Chennai, 30th October 2015

Publications

Publication Type: Journal Article

Year of Publication Publication Type Title

2018

Journal Article

P. V. Veena, Dr. M. Anand Kumar, and Dr. Soman K. P., “Character embedding for language identification in Hindi-English code-mixed social media text”, Computacion y Sistemas, vol. 22, pp. 65-74, 2018.[Abstract]


Social media platforms are now widely used by the people to express their opinion or interest. The language used by the users in social media earlier was purely English. Code-mixed text, i.e., mixing of two or more languages, is commonly seen now. In code-mixed data, one language will be written using another language script. So to process such code-mixed text, identification of language used in each word is important for language processing. The main objective of the work is to propose a technique for identifying the language of Hindi-English code-mixed data used in three social media platforms namely, Facebook, Twitter, and WhatsApp. The classification of Hindi-English code-mixed data into Hindi, English, Named Entity, Acronym, Universal, Mixed (Hindi along with English) and Undefined tags were performed. Popular word embedding features were used for the representation of each word. Two kinds of embedding features were considered - word-based embedding features and character-based context features. The proposed method was done with the addition of context information along with the embedding features. A well-known machine learning classifier, Support Vector Machine was used to train and test the system. The work on Language Identification in code-mixed text using character-based embedding is a novel approach and shows promising results. © 2018 Instituto Politecnico Nacional. All rights reserved.

More »»

2017

Journal Article

K. S. Gokul Krishnan, Pooja, A., Dr. M. Anand Kumar, and Dr. Soman K. P., “Character based bidirectional LSTM for disambiguating tamil part of speech categories”, International Journal of Control Theory and Applications, vol. 10, pp. 229-235, 2017.[Abstract]


Part of speech (POS) tagging is the process of labeling a part of speech tag to each and every word in the corpus. In this paper POS tagging for Tamil language is performed by using Bidirectional Long Short Term Memory. A C2W (character to word) model instead of traditional word lookup table for obtaining word embeddings using BLSTM is presented. The C2W model uses characters to form a vector representation of a word. The word embedding from C2W model is used by BLSTM to tag the words in the corpus. This method, when tested with 3723 words produced highest accuracy of 86.45%. © International Science Press. More »»

2016

Journal Article

S. S. Kumar, Dr. M. Anand Kumar, and Dr. Soman K. P., “Experimental analysis of malayalam pos tagger using epic framework in scala”, ARPN Journal of Engineering and Applied Sciences, vol. 11, pp. 8017-8023, 2016.[Abstract]


In Natural Language Processing (NLP), one of the well-studiedproblems under constant exploration is part-ofspeech tagging or POS tagging or grammatical tagging. The task is to assign labels or syntactic categories such as noun, verb, adjective, adverb, preposition etc. to the words in a sentence or in an un-annotated corpus. This paper presents a simple machine learning based experimental study for POS tagging using a new structured prediction framework known as EPIC, developed in scale programming language. This paper is first of its kind to perform POS tagging in Indian Language using EPIC framework. In this framework, the corpus contains labelled Malayalam sentences in domains like health, tourism and general (news, stories). The EPIC framework uses conditional random field (CRF) for building tagged models. The framework provides several parameters to adjust and arrive at improved accuracy and thereby a better POS tagger model. The overall accuracy were calculated separately for each domains and obtained a maximum accuracy of 85.48%, 85.39%, and 87.35% for small tagged data in health, tourism and general domain. More »»

2016

Journal Article

Sab Seshadri, Madasamy, A. Kab, Padannayil, S. Kab, and Dr. M. Anand Kumar, “Analyzing sentiment in Indian languages micro text using recurrent neural network”, IIOAB Journal, vol. 7, pp. 313-318, 2016.[Abstract]


This paper aims at improving the system which is submitted to the shared task on Sentiment Analysis in Indian Languages (SAIL2015) at MIKE 2015. In this work the tweets are classified into three polarity category namely positive, negative and neutral. Twitter data of three languages namely Tamil, Hindi and Bengali are already provided by SAIL 2015 task organizers as we have participated in the contest. Recurrent neural network is used for analyzing the sentiment in the tweets. The system performs well for recurrent neural network when compared with the system submitted to the shared task as the accuracy of the system had increased. This is due to the fact that the recurrent neural network concentrates more on language specific feature. In training, the recurrent neural network tries to learn based on the error that are generated as intermediate output. By this way the network seeks to pursue sentiment oriented feature which improves in analyzing the sentiments on tweets. We have obtained a state accuracy for the proposed system, where we achieved an accuracy of 88%, 72.01% and 65.16% for Tamil, Hindi and Bengali languages respectively for SAIL 2015 dataset. More »»

2016

Journal Article

S. Singh, Dr. M. Anand Kumar, and Soman, K. P., “CEN@ Amrita: Information Retrieval on CodeMixed Hindi-English Tweets Using Vector Space Models”, Working notes of FIRE, pp. 7–10, 2016.[Abstract]


One of the major challenges nowadays is Information retrieval from social media platforms. Most of the information on these platforms is informal and noisy in nature. It makes the Information retrieval task more challenging. The task is even more difficult for twitter because of its character limitation per tweet. This limitation bounds the user to express himself in condensed set of words. In the context of India, scenario is little more complicated as users prefer to type in their mother tongue but lack of input tools force them to use Roman script with English embeddings. This combination of multiple languages written in the Roman script makes the Information retrieval task even harder. Query processing for such CodeMixed content is a difficult task because query can be in either of the language and it need to be matched with the documents written in any of the language. In this work, we dealt with this problem using Vector Space Models which gave significantly better results than the other participants. The Mean Average Precision (MAP) for our system w More »»

2016

Journal Article

R. M. Kumar, Dr. M. Anand Kumar, Soman, K. P., and Venkatesh, R., “Cuisine Prediction based on Ingredients using Tree Boosting Algorithms”, Indian Journal of Science and Technology, vol. 9, 2016.[Abstract]


This paper aims at predicting the cuisine based on the ingredients using tree boosting algorithm. Methods/ Analysis: Text mining is important tool for data mining in Ecommerce websites. Ecommerce business is growing with significant rate both in Business-to-Business (B2B) and Business to Customer (B2C) categories. The machine learning based models and prediction method are used in real world ecommerce data to increase the revenue and study customer behavior. Many online cooking and recipe sharing websites have ardent to evolution of recipe recommendation system. In this paper, we describe a scalable end to end tree boosting system algorithms to predict cuisine based on the ingredients and also explored different data analysis and explained about the dataset types and their performances. Novelty/ Improvement: An accuracy of about 80% is obtained for cuisine prediction using XG-Boosting algorithm. More »»

2016

Journal Article

B. Premjith, S Kumar, S., Shyam, R., Dr. M. Anand Kumar, and Soman, K. P., “A Fast and Efficient Framework for Creating Parallel Corpus”, Indian Journal of Science and Technology, vol. 9, 2016.[Abstract]


A framework involving Scansnap SV600 scanner and Google Optical character recognition (OCR) for creating parallel corpus which is a very essential component of Statistical Machine Translation (SMT). Methods and Analysis: Training a language model for a SMT system highly depends on the availability of a parallel corpus. An efficacious approach for collecting parallel sentences is the predominant step in an MT system. However, the creation of a parallel corpus requires extensive knowledge in both languages which is a time consuming process. Due to these limitations, making the documents digital becomes very difficult and which in turn affects the quality of machine translation systems. In this paper, we propose a faster and efficient way of generating English to Indian languages parallel corpus with less human involvement. With the help of a special type of scanner called Scansnap SV600 and Google OCR and a little linguistic knowledge, we can create a parallel corpus for any language pair, provided there should be paper documents with parallel sentences. Findings: It was possible to generate 40 parallel sentences in 1 hour time with this approach. Sophisticated morphological tools were used for changing the morphology of the text generated and thereby increase the size of the corpus. An additional benefit of this is to make ancient scriptures or other manuscripts in digital format which can then be referred by the coming generation to keep up the traditions of a nation or a society. Novelty: Time required for creating parallel corpus is reduced by incorporating Google OCR and book scanner. More »»

2016

Journal Article

S. G. Ajay, Srikanth, M., Dr. M. Anand Kumar, and Soman, K. P., “Word Embedding Models for Finding Semantic Relationship between Words in Tamil Language”, Indian Journal of Science and Technology, vol. 9, 2016.[Abstract]


Word embedding models were most predominantly used in many of the NLP tasks such as document classification, author identification, story understanding etc. In this paper we make a comparison of two Word embedding models for semantic similarity in Tamil language. Each of those two models has its own way of predicting relationship between words in a corpus. Method/Analysis: The term Word embedding in Natural Language Processing is a representation of words in terms of vectors. Word embedding is used as an unsupervised approach instead of traditional way of feature extraction. Word embedding models uses neural networks to generate numerical representation for the given words. In order to find the best model that captures semantic relationship between words, using a morphologically rich language like Tamil would be great. Tamil language is one of the oldest Dravidian languages and it is known for its morphological richness. In Tamil language it is possible to construct 10,000 words from a single root word. Findings: Here we make comparison of Content based Word embedding and Context based Word embedding models respectively. We tried different feature vector sizes for the same word to comment on the accuracy of the models for semantic similarity. Novelty/Improvement: Analysing Word embedding models for morphologically rich language like Tamil helps us to classify the words better based on its semantics. More »»

2016

Journal Article

S. Se, Vinayakumar, R., Dr. M. Anand Kumar, and Soman, K. P., “Predicting the Sentimental Reviews in Tamil Movie using Machine Learning Algorithms”, Indian Journal of Science and Technology, vol. 9, 2016.[Abstract]


This paper aims at classifying the Tamil movie reviews as positive and negative using supervised machine learning algorithms. Methods/Analysis: A novel machine learning approaches are needed for analyzing the Social media text where the data are increasing exponentially. Here, in this work, Machine learning algorithms such as SVM, Maxent classifier, Decision tree and Naive Bayes are used for classifying Tamil movie reviews into positive and negative. Features are also extracted from TamilSentiwordnet. Findings: The dataset for this work has been prepared. SVM algorithm performs well in classifying the Tamil movie reviews when compared with other machine learning algorithms. Both cross validation and accuracy of the algorithm shows that SVM performs well. Other than SVM, Decision tree perform well in classifying the Tamil reviews. Novelty/Improvement: SVM gives an accuracy of 75.9% for classifying Tamil movie reviews which is a good milestone in the research field of Tamil language. More »»

2016

Journal Article

S. Sudhakaran, Jose, S., Dr. M. Anand Kumar, and Soman, K. P., “Knowledge based Approach for English-Malayalam Parallel Corpus Generation”, Indian Journal of Science and Technology, vol. 9, 2016.[Abstract]


This paper aims in providing an overview about a part of Natural Language Generation – Parallel sentence generation which involves the generation of the English sentence as well as its Malayalam translated version. Methods/Analysis: A template based sentence generator approach is followed here. A system is proposed which takes input from a manually created bilingual dictionary and fills the slots in the template for parallel sentence generation. Finding: Using the proposed method, we have generated a total of 25,208 parallel sentences. This can be used in bilingual Machine Translation dictionary. Application/Improvement: In the proposed case use only four templates but by increasing the number of templates and by updating the dictionary, we can increase the size of the parallel corpus that can be generated. More »»

2016

Journal Article

S. P. Sanjay, Dr. M. Anand Kumar, and Soman, K. P., “AmritaCEN at SemEval-2016 Task 11: Complex Word Identification using Word Embedding”, Proceedings of SemEval, pp. 1022–1027, 2016.[Abstract]


Complex word identification task focuses on identifying the difficult word from English sentence for a Non-Native speakers. NonNative speakers are those who don’t have English as their native language. It is a subtask for lexical simplification. We have experimented with word embedding features, orthographic word features, similarity features and POS tag features which improves the performance of the classification. In addition to the SemEval 2016 results we have evaluated the training data by varying the vector dimension size and obtained the best possible size for producing better performance. The SVM learning algorithm will attains constant and maximum accuracy through linear classifier. We achieve a G-score of 0.43 and 0.54 on computing complex words for two systems.

More »»

2015

Journal Article

S. N. Vinithra, Dr. M. Anand Kumar, and Dr. Soman K. P., “Analysis of sentiment classification for Hindi movie reviews: A comparison of different classifiers”, International Journal of Applied Engineering Research, vol. 10, 2015.[Abstract]


To decide on anything in our day to day life, it is important to have an opinion. Every opinion has a sentiment which helps in carrying decisions easier. There is a huge amount of data on the web which needs to be mined in order to find its sentiment. This paper aims at classifying labelled textual Hindi movie reviews with different classifiers. The dataset has been segregated into positive and negative reviews before processing. The goal of this paper is to predict the sentiment of the online movie review which is in form of documents with varied size. A 10-fold-cross-validation is done in order to check the calibre of the classifier used. The test accuracy is checked using the F1 score considering both precision and recall. A detailed comparison of the unigram and bigram feature‟s accuracy of all the mentioned models is done. The proposed model is classified on the following classifiers Naïve Bayes, Logistic Regression and Random Kitchen Sink algorithm. Each one of these algorithms gave better accuracy when bigram was performed. Out of these four classifying algorithms, it is observed that Naive Bayes Multinomial model has the best accuracy with a 70.37%. Hence, this sentiment analysis model which is a developing big data application is suggested for industrial applications wherein predicting the sentiment is a vital component. More »»

2015

Journal Article

Dr. M. Anand Kumar, Se, S., and Soman, K. P., “AMRITA_CEN@ FIRE 2015: Extracting Entities for Social Media Texts in Indian Languages”, 2015.

2015

Journal Article

S. P. Sanjay, Dr. M. Anand Kumar, and Soman, K. P., “AMRITA_CEN-NLP@ FIRE 2015: CRF BASED NAMED ENTITY EXTRATION FOR TWITTER MICROPOSTS”, 2015.[Abstract]


This proposed method implements the Named Entity Recognition (NER) for four dialects Such as English, Tamil, Malayalam, and Hindi. The results obtained from this work are submitted to a research evaluation workshop Forum for Information Retrieval and Evaluation (FIRE 2015). It is single-layered problem which is divided into multi- layered this step is called pre-processing; it has three levels of named entity tags which are referred as BIO format. This format is trained using Condition Random field(CRF) are used for implementing in NER system , the results obtained are grouped back to single-label or single-tagged referred as Format converting. In FIRE 2015, we developed English, Tamil, Malayalam, and Hindi NER system using CRF. The FIRE estimated the average precision for all the four languages. More »»

2015

Journal Article

R. Venkatesh Kumar, Dr. M. Anand Kumar, and Soman, K. P., “AmritaCEN_NLP@ FIRE 2015 Language Identification for Indian Languages in Social Media Text”, 2015.[Abstract]


The progression of social media contents, similar like Twitter and Facebook messages and blog post, has created, many new opportunities for language technology. The user generated contents such as tweets and blogs in most of the languages are written using Roman script due to distinct social culture and technology. Some of them using own language script and mixed script. The primary challenges in process the short message is identifying languages. Therefore, the language identification is not restricted to a language but also to multiple languages. The task is to label the words with the following categories L1, L2, Named Entities, Mixed, Punctuation and Others This paper presents the AmritaCen_NLP team participation in FIRE2015-Shared Task on Mixed Script Information Retrieval Subtask 1: Query Word Labeling on language identification of each word in text, Named Entities, Mixed, Punctuation and Others which uses sequence level query labelling with Support Vector Machine. More »»

2015

Journal Article

S. P. Sanjay, Ezhilarasan, N., Dr. M. Anand Kumar, and P, S. K., “AMRITA-CEN@ FIRE2015: Automated Story Illustration using Word Embedding”, 2015.[Abstract]


Story books are copiously filled with image illustration in which the illustrations are essential to the enjoyment and understanding of the story. Often the photos themselves turn out to be more important than content. In such cases, our principle job is to locate the best pictures to show. Stories composed for kids must be improved with pictures to manage the enthusiasm of a tyke, for words usually can't do a picture justice. This system is built as a part of shared task of Forum of Information Retrieval and Evaluation (FIRE) 2015 workshop. In this system we provide a methodology for automatically illustrating a given Children’s story using the Wikipedia ImageCLEF 2010 dataset, with appropriate images for better learning and understanding. More »»

2010

Journal Article

P. J. Antony, Dr. M. Anand Kumar, and Soman, K., “Paradigm based morphological analyzer for kannada language using machine learning approach”, International journal on-Advances in Computer Science and Technology (ACST), ISSN 0973-6107, vol. 3, pp. 457–481, 2010.

Publication Type: Conference Paper

Year of Publication Publication Type Title

2016

Conference Paper

R. G. Devi, Veena, P. V., Dr. M. Anand Kumar, and Dr. Soman K. P., “AMRITA-CEN@FIRE 2016: Code-mix entity extraction for Hindi-English and Tamil-English tweets”, in CEUR Workshop Proceedings, 2016, vol. 1737, pp. 304-308.[Abstract]


Social media text holds information regarding various important aspects. Extraction of such information serves as the basis for the most preliminary task in Natural Language Processing called Entity extraction. The work is submitted as a part of Shared task on Code Mix Entity Extraction for Indian Languages(CMEE-IL) at Forum for Information Retrieval Evaluation (FIRE) 2016. Three different methodology is proposed in this paper for the task of entity extraction for code-mix data. Proposed systems include approaches based on the Embedding models and feature based model. Creation of trigram embedding and BIO tag formatting were done during feature extraction. Evaluation of the system is carried out using machine learning based classifier, SVM-Light. Overall accuracy through cross validation has proven that the proposed system is efficient in classifying unknown tokens too

More »»

2016

Conference Paper

S. V. Skanda, Singh, S., G. Devi, R., Veena, P. V., Dr. M. Anand Kumar, and Dr. Soman K. P., “CEN@Amrita FIRE 2016: Context based character embeddings for entity extraction in code-mixed text”, in CEUR Workshop Proceedings, 2016, vol. 1737, pp. 321-324.[Abstract]


This paper presents the working methodology and results on Code Mix Entity Extraction in Indian Languages (CMEE-IL) shared the task of FIRE-2016. The aim of the task is to identify various entities such as a person, organization, movie and location names in a given code-mixed tweets. The tweets in code mix are written in English mixed with Hindi or Tamil. In this work, Entity Extraction system is implemented for both Hindi-English and Tamil-English code-mix tweets. The system employs context based character embedding features to train Support Vector Machine (SVM) classifier. The training data was tokenized such that each line containing a single word. These words were further split into characters. Embedding vectors of these characters are appended with the I-O-B tags and used for training the system. During the testing phase, we use context embedding features to predict the entity tags for characters in test data. We observed that the cross-validation accuracy using character embedding gave better results for Hindi-English twitter dataset compare to Tamil-English twitter dataset.

More »»

2016

Conference Paper

P. V. Veena, G. Devi, R., Dr. M. Anand Kumar, and Dr. Soman K. P., “AMRITA-CEN@FIRE 2016: Consumer Health Information Search using keyword and word embedding features”, in CEUR Workshop Proceedings, 2016, vol. 1737, pp. 197-200.[Abstract]


This work is submitted to Consumer Health Information Search (CHIS) Shared Task in Forum for Information Retrieval Evaluation (FIRE) 2016. Information retrieval from any part of web should include informative content relevant to the search of web user. Hence the major task is to retrieve only relevant documents according to the users query. The given task includes further refinement of the classification process into three categories of relevance such as support, oppose and neutral. Any user reading an article from web must know whether the content of that article supports or opposes title of the article. This seems to be a big challenge to the system. Our proposed system is developed based on the combination of Keyword based features and Word embedding based features. Classification of sentences is done by machine learning based classifier, Support Vector Machine (SVM).

More »»

2016

Conference Paper

H. B. Barathi Ganesh, Dr. M. Anand Kumar, and Dr. Soman K. P., “Distributional semantic representation in health care text classification”, in CEUR Workshop Proceedings, 2016, vol. 1737, pp. 201-204.[Abstract]


This paper describes about the our proposed system in the Consumer Health Information Search (CHIS) task. The objective of the task 1 is to classify the sentences in the document into relevant or irrelevant with respect to the query and task 2 is analysing the sentiment of the sentences in the documents with respect to the given query. In this proposed approach distributional representation of text along with its statistical and distance measures are carried over to perform the given tasks as a text classification problem. In our experiment, Non - Negative Matrix Factorization utilized to get the distributed representation of the document as well as queries, distance and correlation measures taken as the features and Random Forest Tree utilized to perform the classification. The proposed approach yields 70.19% in task 1 and 34.64% in task 2 as an average accuracy.

More »»

2016

Conference Paper

H. B. Barathi Ganesh, Dr. M. Anand Kumar, and Dr. Soman K. P., “Conditional random fields for code mixed Entity Recognition”, in CEUR Workshop Proceedings, 2016, vol. 1737, pp. 309-312.[Abstract]


Entity Recognition is an essential part of Information Extraction, where explicitly available information and relations are extracted from the entities within the text. Plethora of information is available in social media in the form of text and due to its nature of free style representation, it introduces much complexity while mining information out of it. This complexity is enhanced more by representing the text in more than one language and the usage of transliterated words. In this work we utilized sequential modeling algorithm with hybrid features to perform the Entity Recognition on the corpus given by CMEE-IL (Code Mixed Entity Extraction - Indian Language) organizers. The experimented approach performed great on both the Tamil-English and Hindi-English tweet corpus by attaining nearly 95% against the training corpus and 45.17%, 31.44% against the testing corpus.

More »»

2016

Conference Paper

S. Singh, Dr. M. Anand Kumar, and Dr. Soman K. P., “CEN@Amrita: Information retrieval on CodeMixed Hindi English tweets using vector space models”, in CEUR Workshop Proceedings, 2016, vol. 1737, pp. 131-134.[Abstract]


One of the major challenges nowadays is Information retrieval from social media platforms. Most of the information on these platforms is informal and noisy in nature. It makes the Information retrieval task more challenging. The task is even more difficult for twitter because of its character limitation per tweet. This limitation bounds the user to express himself in condensed set of words. In the context of India, scenario is little more complicated as users prefer to type in their mother tongue but lack of input tools force them to use Roman script with English embeddings. This combination of multiple languages written in the Roman script makes the Information retrieval task even harder. Query processing for such CodeMixed content is a difficult task because query can be in either of the language and it need to be matched with the documents written in any of the language. In this work, we dealt with this problem using Vector Space Models which gave significantly better results than the other participants. The Mean Average Precision (MAP) for our system was 0.0315 which was second best performance for the subtask. More »»

2016

Conference Paper

Dr. M. Anand Kumar, Singh, S., Kavirajan, B., and Dr. Soman K. P., “DPIL@FIRE 2016: Overview of shared task on detecting paraphrases in Indian Languages (DPIL)”, in CEUR Workshop Proceedings, 2016, vol. 1737, pp. 233-238.[Abstract]


This paper explains the overview of the shared task "Detecting Paraphrases in Indian Languages" (DPIL) conducted at FIRE 2016. Given a pair of sentences in the same language, participants are asked to detect the semantic equivalence between the sentences. The shared task is proposed for four Indian languages namely Tamil, Malayalam, Hindi and Punjabi. The dataset created for the shared task has been made available online and it is the first open-source paraphrase detection corpora for Indian languages.

More »»

2016

Conference Paper

H. B. Barathi Ganesh, Dr. M. Anand Kumar, and Dr. Soman K. P., “Distributional semantic representation for text classification and information retrieval”, in CEUR Workshop Proceedings, 2016, vol. 1737, pp. 126-130.[Abstract]


The objective of this experiment is to validate the performance of the distributional semantic representation of text in the classification (Question Classification) task and the Information Retrieval task. Followed by the distributional representation, first level classification of the questions is performed and relevant tweets with respect to the given queries are retrieved. The distributional representation of text is obtained by performing Non - Negative Matrix Factorization on top of the Document - Term Matrix in the training and test corpus. To improve the semantic representation of the text, phrases are also considered along with the words. This proposed approach achieved 80% as a F-1 measure and 0.0377 as a mean average precision against the its respective Mixed Script Information Retrieval task1 and task 2 test sets.

More »»

2016

Conference Paper

Dr. M. Anand Kumar, Dr. Soman K. P., and Dr. Soman K. P., “Amrita-CEN@MSIR-FIRE2016: Code-mixed question classification using BoWs and RNN Embeddings”, in CEUR Workshop Proceedings, 2016, vol. 1737, pp. 122-125.[Abstract]


Question classification is a key task in many question answering applications. Nearly all previous work on question classification has used machine learning and knowledge-based methods. This working note presents an embedding based Bag-of-Words method and Recurrent Neural Network to achieve an automatic question classification in the code-mixed Bengali-English text. We build two systems that classify questions mostly at the sentence level. We used a recurrent neural network for extracting features from the questions and Logistic regression for classification. We conduct experiments on Mixed Script Information Retrieval (MSIR) Task 1 dataset at FIRE20161. The experimental result shows that the proposed method is appropriate for the question classification task.

More »»

2016

Conference Paper

B. Ganesh, Dr. M. Anand Kumar, and P, S. K., “Statistical Semantics in Context Space Amrita\_CEN; Author Profiling”, in Working Notes of CLEF 2016 - Conference and Labs of the Evaluation forum, 2016.

2015

Conference Paper

, Dr. M. Anand Kumar, and Dr. Soman K. P., “Deep Belief Network based Part of Speech Tagger for Telugu Language”, in 2nd IC3T International Conference on Computer and Communication Technologies, 2015.

2015

Conference Paper

M. S., Dr. M. Anand Kumar, and Dr. Soman K. P., “Paraphrase Detection for Tamil language using Deep learning algorithms”, in International Conference on Big Data and Cloud Computing (ICBDCC-2015), 2015.

2015

Conference Paper

H. B. Barathi Ganesh, Abinaya, N., Dr. M. Anand Kumar, Vinayakumar, R., and Dr. Soman K. P., “AMRITA - CEN@NEEL : Identification and linking of twitter entities”, in CEUR Workshop Proceedings, Florence; Italy, 2015, vol. 1395, pp. 64-65.[Abstract]


A short text gets updated every now and then. With the global upswing of such micro posts, the need to retrieve information from them also seems to be incumbent. This work focuses on the knowledge extraction from the micro posts by having entity as evidence. Here the extracted entities are then linked to their relevant DBpedia source by featurization, Part Of Speech (POS) tagging, Named Entity Recognition (NER) and Word Sense Disambiguation (WSD). This short paper encompasses its contribution to #Micropost2015 - NEEL task by experimenting existing Machine Learning (ML) algorithms. Copyright © 2015 held by author(s More »»

2015

Conference Paper

N. Abinaya, John, N., Ganesh, B. H. B., Dr. M. Anand Kumar, and Soman, K. P., “AMRITA_CEN@FIRE-2014: Named Entity Recognition for Indian Languages Using Rich Features”, in Proceedings of the Forum for Information Retrieval Evaluation, New York, NY, USA, 2015.[Abstract]


This paper aims at implementing Named Entity Recognition (NER) for four languages such as English, Tamil, Hindi and Malayalam. The results obtained from this work are submitted to a research evaluation workshop Forum for Information Retrieval and Evaluation (FIRE 2014). This system detects three levels of named entity tags which are referred as nested named entities. It is a multi-label problem solved using chain classifier method. In this work, Conditional Random Field (CRF) and Support Vector Machine (SVM) are used for implementing NER system. In FIRE 2014, we developed a English NER system using CRF and other NER system for Tamil, Hindi and Malayalam are based on SVM. The FIRE estimated the average precision for all the four languages as 41.93 for outermost level and 33.25 for inner level. In order to improve the performance of Indian languages, we implemented CRF based NER system for the same corpus in Tamil, Hindi and Malayalam. The average precision measure for these mentioned languages are 42.87 for outer level and 36.31 for inner level. The overall performance of the NER system improved by 2.24% for outer level and 9.20% for inner level. More »»

2015

Conference Paper

Dr. M. Anand Kumar and Soman, K. P., “AMRITA_CEN@ ICON-2015: Part-of-Speech Tagging on Indian Language Mixed Scripts in Social Media”, in ICON 2015, 2015.

2014

Conference Paper

P. Sanjanaashree, Dr. M. Anand Kumar, and Dr. Soman K. P., “Language learning for visual and auditory learners using scratch toolkit”, in 2014 International Conference on Computer Communication and Informatics: Ushering in Technologies of Tomorrow, Today, ICCCI 2014, https://www.scopus.com/record/display.uri?eid=2-s2.0-84911391150&origin=inward&txGid=0, 2014.[Abstract]


In recent years, with the development of technology, life has become very easy. Computers have become the life line of today's high-tech world. There is no work in our whole day without the use of computers. When we focus particularly in the field of education, people started preferring to e-books than carrying textbooks. In the phase of learning, visualization plays a major role. When the visualization tool and auditory learning comes together, it brings the in-depth understanding of data and their phoneme sequence through animation and with proper pronunciation of the words, which is far better than the people learning from the textbooks and imagining in their perspective and have their own pronunciation. Scratch with its visual, block-based programming platform is widely used among high school kids to learn programming basics. We investigated that in many schools around the world uses this scratch for students to learn programming basics. Literature review shows that students find it interesting and are very curious about it. This made us anxious towards natural language learning using scratch because of its interesting visual platform. This paper is based on the concept of visual and auditory learning. Here, we described how we make use of this scratch toolkit for learning the secondary language. We also claim that this visual learning will help people remember easily than to read as texts in books and the auditory learning helps in proper pronunciation of words rather than expecting someone's help. We have developed a scratch based tool for learning simple sentence construction of secondary language through primary language. In this paper, languages used are English (secondary language) and Tamil (primary language). This is an enterprise for language learning tool in scratch. This is applicable for other language specific exercises and can be adopted easily for other languages too. © 2014 IEEE. More »»

2014

Conference Paper

P. Sanjanaashree and Dr. M. Anand Kumar, “Joint layer based deep learning framework for bilingual machine transliteration”, in Proceedings of the 2014 International Conference on Advances in Computing, Communications and Informatics, ICACCI 2014, ICACCI 2014; Delhi; India;, 2014, pp. 1737 - 1743.[Abstract]


Between the growth of Internet or World Wide Web (WWW) and the emersion of the social networking site like Friendster, Myspace etc., information society started facing exhilarating challenges in language technology applications such as Machine Translation (MT) and Information Retrieval (IR). Nevertheless, there were researchers working in Machine Translation that deal with real time information for over 50 years since the first computer has come along. Merely, the need for translating data has become larger than before as the world was getting together through social media. Especially, translating proper nouns and technical terms has become openly challenging task in Machine Translation. The Machine transliteration was emerged as a part of information retrieval and machine translation projects to translate the Named Entities based on phoneme and grapheme, hence, those are not registered in the dictionary. Many researchers have used approaches such as conventional Graphical models and also adopted other machine translation techniques for Machine Transliteration. Machine Transliteration was always looked as a Machine Learning Problem. In this paper, we presented a new area of Machine Learning approach termed as a Deep Learning for improving the bilingual machine transliteration task for Tamil and English languages with limited corpus. This technique precedes Artificial Intelligence. The system is built on Deep Belief Network (DBN), a generative graphical model, which has been proved to work well with other Machine Learning problem. We have obtained 79.46% accuracy for English to Tamil transliteration task and 78.4 % for Tamil to English transliteration. © 2014 IEEE. More »»

2014

Conference Paper

A. Aravind and Dr. M. Anand Kumar, “Machine learning approach for correcting preposition errors using SVD features”, in Proceedings of the 2014 International Conference on Advances in Computing, Communications and Informatics, ICACCI 2014, 2014.

2014

Conference Paper

Dr. M. Anand Kumar, V., D., Soman K. P., and V., S., “Improving the Performance of English-Tamil Statistical Machine Translation System using Source-Side Pre-Processing”, in Proceedings of International Conference on Advances in Computer Science, AETACS, 2014.[Abstract]


Machine Translation is one of the major oldest and the most active research area in Natural Language Processing. Currently, Statistical Machine Translation (SMT) dominates the Machine Translation research. Statistical Machine Translation is an approach to Machine Translation which uses models to learn translation patterns directly from data, and generalize them to translate a new unseen text. The SMT approach is largely language independent, i.e. the models can be applied to any language pair. Statistical Machine Translation (SMT) attempts to generate translations using statistical methods based on bilingual text corpora. Where such corpora are available, excellent results can be attained translating similar texts, but such corpora are still not available for many language pairs. Statistical Machine Translation systems, in general, have difficulty in handling the morphology on the source or the target side especially for morphologically rich languages. Errors in morphology or syntax in the target language can have severe consequences on meaning of the sentence. They change the grammatical function of words or the understanding of the sentence through the incorrect tense information in verb. Baseline SMT also known as Phrase Based Statistical Machine Translation (PBSMT) system does not use any linguistic information and it only operates on surface word form. Recent researches shown that adding linguistic information helps to improve the accuracy of the translation with less amount of bilingual corpora. Adding linguistic information can be done using the Factored Statistical Machine Translation system through pre-processing steps. This paper investigates about how English side pre-processing is used to improve the accuracy of English-Tamil SMT system.

More »»

2014

Conference Paper

, Anirudh Nair, Dr. M. Anand Kumar, and Dr. Soman K. P., “AMRITA@ FIRE-2014: Named Entity Recognition for Indian languages (Working notes)”, in International Workshop: "NER shared Task" Forum for Information Retrieval Evaluation (FIRE-2014), Bengaluru, 2014.

2014

Conference Paper

Dr. M. Anand Kumar, Rajendran, S., and Dr. Soman K. P., “AMRITA@ FIRE-2014: Morpheme Extraction for Tamil using Machine Learning (Working notes)”, in International Workshop: "MET shared Task" Forum for Information Retrieval Evaluation (FIRE- 2014), Bengaluru , 2014.

2014

Conference Paper

Dr. M. Anand Kumar and Soman, K. P., “AMRITA_CEN@ FIRE-2014: Morpheme Extraction and Lemmatization for Tamil using Machine Learning”, in Proceedings of the Forum for Information Retrieval Evaluation, 2014.

2011

Conference Paper

Dr. M. Anand Kumar, “Morphological Generator for Tamil”, in National Seminar on Computational Linguistics and Language Technology, Annamalai University,Chidambaram, 2011.

2011

Conference Paper

R. Dhivya, Dhanalakshmi, V., Dr. M. Anand Kumar, and Soman, K. P., “Clause Boundary Identification For Tamil Language Using Dependency Parsing - SPIT2011”, in International Joint Conference on Advances in Signal Processing and Information Technology – SPIT 2011, 2011.

2010

Conference Paper

Dr. M. Anand Kumar, Dhanalakshmi, V. V., Rajendran, S., Dr. Soman K. P., and Rekha, K. U., “A novel algorithm for Tamil morphological generator (Best Second Paper)”, in 8th International Conference on Natural Language Processing ( ICON2010), IIT Kharagpur, India, 2010.[Abstract]


Tamil is a morphologically rich language with agglutinative nature. Being agglutinative language most of the word features are postpositionally affixed to the root word. The morphological generator takes lemma, POS category and morpholexical description as input and gives a word-form as output. It is a reverse process of morphological analyzer. In any natural language generation system, morphological generator is an essential component in post processing stage. Morphological generator system implemented here is based on a new algorithm, which is simple, efficient and does not require any rules and morpheme dictionary. A paradigm classification is done for noun and verb based on S.Rajendran’s paradigm classification. Tamil verbs are classified into 32 paradigms with 1884 inflected forms. Like verbs, nouns are classified into 25 paradigms with 325 word forms. This approach requires only minimum amount of data. So this approach can be easily implemented to less resourced and morphologically rich languages. More »»

2008

Conference Paper

D. V., Dr. M. Anand Kumar, S., V. M., R., L., Soman K. P., and S., R., “Tamil Part-of-Speech Tagger based on SVM Tool”, in Proceedings of International Conference on Asian Language Processing 2008 (IALP 2008), Chiang Mai, Thailand, 2008.

207
PROGRAMS
OFFERED
6
AMRITA
CAMPUSES
15
CONSTITUENT
SCHOOLS
A
GRADE BY
NAAC, MHRD
8th
RANK(INDIA):
NIRF 2018
150+
INTERNATIONAL
PARTNERS