Back close

Transformer Based Language Identification for Malayalam-English Code-Mixed Text

Publication Type : Journal Article

Publisher : IEEE Access

Source : IEEE Access, vol. 9, pp. 118837-118850, 2021, doi: 10.1109/ACCESS.2021.3104106.

Url : https://ieeexplore.ieee.org/abstract/document/9511454

Campus : Amritapuri

School : School of Computing

Center : Computational Linguistics and Indic Studies

Year : 2021

Abstract : Social media users have the proclivity to write majority of the data for under resourced languages in code-mixed format. Code-mixing is defined as mixing of two or more languages in a single sentence. Research in code-mixed text helps apprehend security threats, prevalent on social media platforms. In such instances, language identification is an imperative task of code-mixed text. The focus of this paper is to carry out a word-level language identification (WLLI) of Malayalam-English code-mixed data, from social media platforms like YouTube. This study was centered around BERT, a transformer model, along with its variants - CamemBERT, DistilBERT - for intuitive perception of the language at the word-level. The propounded approach entails tagging Malayalam-English code-mixed data set with six labels: Malayalam (mal), English (eng), acronyms (acr), universal (univ), mixed (mix) and undefined (undef). Newly developed corpus of Malayalam-English was deployed for appraisal of the effectiveness of state-of-the-art models like BERT. Evaluation of the proffered approach, accomplished with other code-mixed language such as Hindi-English, notched a 9% increase in the F1-score.

Cite this Research Publication : S. Thara and P. Poornachandran, "Transformer Based Language Identification for Malayalam-English Code-Mixed Text," in IEEE Access, vol. 9, pp. 118837-118850, 2021, doi: 10.1109/ACCESS.2021.3104106.

Admissions Apply Now