Social media sites like Twitter, Facebook, being user-friendly and a free source, provide opportunities to people to air their voice. People, irrespective of the age group, use these sites to share every moment of their life making these sites flooded with data. Apart from these commendable features, these sites have down side as well. Due to lack of restrictions set by these sites for its users to express their views as they like, anybody can make adverse and unrealistic comments in abusive language against anybody with an ulterior motive to tarnish one’s image and status in the society. So it became a huge responsibility for the Government and these sites to identify this hate content before it disseminates to mass. Automatic hate speech detection faces quite a lot of challenges due to the non-standard variations in spelling and grammar. Especially for a country like India with huge multilingual and bilingual population, this hate content would be in code-mixed form which makes the task demanding. So our paper projects a machine learning model to detect hate speech in Hindi-English code-mixed social media text. The methodology makes use of Facebook’s pre-trained word embedding library, fastText to represent 10000 data samples collected from different sources as hate and non-hate. The performance of the proposed methodology is compared with word2vec and doc2vec features and it is observed that fastText features gave better feature representation with Support Vector Machine (SVM)-Radial Basis Funcrion (RBF) classifier. The paper also provides an insight to the researchers working in the field of code-mixed data that character level features provide best result for code-mixed data.
K. Sreelakshmi, Premjith, B., and Dr. Soman K. P., “Detection of Hate Speech Text in Hindi-English Code-mixed Data”, Procedia Computer Science, vol. 171, pp. 737 - 744, 2020.