Transformer Based Language Identification for Malayalam-English Code-Mixed Text

Publication Type : Journal Article

Publisher : IEEE Access

Source : IEEE Access, vol. 9, pp. 118837-118850, 2021, doi: 10.1109/ACCESS.2021.3104106.

Url : https://ieeexplore.ieee.org/abstract/document/9511454

Campus : Amritapuri

School : School of Computing

Center : Computational Linguistics and Indic Studies

Year : 2021

Abstract : Social media users have the proclivity to write majority of the data for under resourced languages in code-mixed format. Code-mixing is defined as mixing of two or more languages in a single sentence. Research in code-mixed text helps apprehend security threats, prevalent on social media platforms. In such instances, language identification is an imperative task of code-mixed text. The focus of this paper is to carry out a word-level language identification (WLLI) of Malayalam-English code-mixed data, from social media platforms like YouTube. This study was centered around BERT, a transformer model, along with its variants - CamemBERT, DistilBERT - for intuitive perception of the language at the word-level. The propounded approach entails tagging Malayalam-English code-mixed data set with six labels: Malayalam (mal), English (eng), acronyms (acr), universal (univ), mixed (mix) and undefined (undef). Newly developed corpus of Malayalam-English was deployed for appraisal of the effectiveness of state-of-the-art models like BERT. Evaluation of the proffered approach, accomplished with other code-mixed language such as Hindi-English, notched a 9% increase in the F1-score.

Cite this Research Publication : S. Thara and P. Poornachandran, "Transformer Based Language Identification for Malayalam-English Code-Mixed Text," in IEEE Access, vol. 9, pp. 118837-118850, 2021, doi: 10.1109/ACCESS.2021.3104106.

About Amrita Vishwa Vidyapeetham

Rankings

Accreditation

Governance

Chancellor

Leadership

Press Media

Newsletters

Amritapuri
Campus

Amaravati
Campus

Bengaluru
Campus

Chennai
Campus

Coimbatore
Campus

Faridabad
Campus

Kochi
Campus

Mysuru
Campus

Nagercoil
Campus

Haridwar
(Proposed Campus)

Research

Centers

Patents

Publication