Word Level Language Identification in Code-Mixed Data using Word Embedding Methods for Indian Languages

Publication Type : Conference Paper

Publisher : 2018 International Conference on Advances in Computing, Communications and Informatics (ICACCI), Bangalore, India.

Source : 2018 International Conference on Advances in Computing, Communications and Informatics (ICACCI), Bangalore, India, 2018

Url : https://ieeexplore.ieee.org/abstract/document/8554501

Keywords : Adaboost and Random Forest, Code mixed, code-mixed data, Continuous Bag of Words, Dictionaries, F-Score, Facebook, Feature vectors, Forestry, Gauss Naive Bayes, Hidden Markov models, Indian languages, Internet, K-nearest neighbors, learning (artificial intelligence), logistic regression, Logistics, Machine Learning Algorithm, macro averaging, micro averaging, NAtural language processing, Precision, Recall, Regression analysis, Search engines, Skip-Gram, social media networking, Social network services, Social networking (online), Support Vector Machine, Support vector machines, Tagging, Web-based social networking system, word embedding methods, word level language identification, Word2vec

Campus : Amritapuri

School : Department of Computer Science and Engineering, School of Engineering

Center : Computational Linguistics and Indic Studies

Department : Computer Science

Verified : No

Year : 2018

Abstract : In recent years, social media networking has grown to be a marvel of technology in our way of life. Facebook operates the world's leading web-based social networking system with over 2.19 billion clients(as of the first quarter of 2018). As its popularity increased, more individuals from all age demographics, have been accessing this growing phenomenon. Resultant usage of code-mixed data has become an all too common practice in the context of social media. The aim of our project was to identify different languages in the processing of code-mixed data. A comparison of different word embedding methods like Continuous Bag of Words (CBOW) and Skip-Gram models was used to generate feature vectors. These vectors are given as input to the machine learning algorithms like Support Vector Machine, Logistic Regression, K-Nearest Neighbors, Gauss Naive Bayes, Adaboost, and Random Forest which yielded in good cross-validation scores. The paper also reveals that Precision, Recall, F-Score, Micro and Macro averaging were used as evaluation measures.

Cite this Research Publication : I. Chaitanya, Madapakula, I., Gupta, S. K., and Thara, S., “Word Level Language Identification in Code-Mixed Data using Word Embedding Methods for Indian Languages”, in 2018 International Conference on Advances in Computing, Communications and Informatics (ICACCI), Bangalore, India, 2018

About Amrita Vishwa Vidyapeetham

Rankings

Accreditation

Governance

Chancellor

Leadership

Press Media

Newsletters

Amritapuri
Campus

Amaravati
Campus

Bengaluru
Campus

Chennai
Campus

Coimbatore
Campus

Faridabad
Campus

Kochi
Campus

Mysuru
Campus

Nagercoil
Campus

Haridwar
(Proposed Campus)

Research

Centers

Patents

Publication