Detection of Hate Speech Text in Hindi-English Code-mixed Data

Publication Type : Conference Paper

Publisher : Procedia Computer Science

Source : Procedia Computer Science, Volume 171, p.737 - 744 (2020)

Url : http://www.sciencedirect.com/science/article/pii/S1877050920310498

Keywords : Code-mixed, fastText, Hate speech, NAtural language processing, Support Vector Machine

Campus : Coimbatore

School : School of Engineering

Center : Center for Computational Engineering and Networking

Department : Center for Computational Engineering and Networking (CEN), Electronics and Communication

Year : 2020

Abstract : Social media sites like Twitter, Facebook, being user-friendly and a free source, provide opportunities to people to air their voice. People, irrespective of the age group, use these sites to share every moment of their life making these sites flooded with data. Apart from these commendable features, these sites have down side as well. Due to lack of restrictions set by these sites for its users to express their views as they like, anybody can make adverse and unrealistic comments in abusive language against anybody with an ulterior motive to tarnish one’s image and status in the society. So it became a huge responsibility for the Government and these sites to identify this hate content before it disseminates to mass. Automatic hate speech detection faces quite a lot of challenges due to the non-standard variations in spelling and grammar. Especially for a country like India with huge multilingual and bilingual population, this hate content would be in code-mixed form which makes the task demanding. So our paper projects a machine learning model to detect hate speech in Hindi-English code-mixed social media text. The methodology makes use of Facebook’s pre-trained word embedding library, fastText to represent 10000 data samples collected from different sources as hate and non-hate. The performance of the proposed methodology is compared with word2vec and doc2vec features and it is observed that fastText features gave better feature representation with Support Vector Machine (SVM)-Radial Basis Funcrion (RBF) classifier. The paper also provides an insight to the researchers working in the field of code-mixed data that character level features provide best result for code-mixed data.

Cite this Research Publication : Sreelakshmi K., B. Premjith, and Dr. Soman K. P., “Detection of Hate Speech Text in Hindi-English Code-mixed Data”, in Procedia Computer Science, 2020, vol. 171, pp. 737 - 744.

About Amrita Vishwa Vidyapeetham

Rankings

Accreditation

Governance

Chancellor

Leadership

Press Media

Newsletters

Amritapuri
Campus

Amaravati
Campus

Bengaluru
Campus

Chennai
Campus

Coimbatore
Campus

Faridabad
Campus

Kochi
Campus

Mysuru
Campus

Nagercoil
Campus

Haridwar

Research

Centers

Patents

Publication