Publication Type : Conference Paper
Publisher : Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Source : Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), Springer Verlag, Volume 10478 LNCS, p.206-218 (2018)
Url : https://www.scopus.com/inward/record.uri?eid=2-s2.0-85041849478&doi=10.1007%2f978-3-319-73606-8_16&partnerID=40&md5=015290ce32cfa7ede3f79affbf15881e
ISBN : 9783319736051
Keywords : Clustering algorithms, Code-mixed text, Codes (symbols), Data mining, Entity extractions, extraction, Learning systems, Social media, Social networking (online), Support vector machines, Text processing, Tri grams, Word embedding
Campus : Coimbatore
School : School of Engineering
Center : Computational Engineering and Networking
Department : Electronics and Communication
Year : 2018
Abstract : Social media play an important role in, today’s society. Social media is the platform for people to express their opinion about various aspects using natural language. The social media text generally contains code-mixed content. The use of code-mixed data is popular in them because the users tend to mix multiple languages in their conversation instead of using their native script as unicode characters. Entity extraction, the task of extracting useful entities like Person, Location and Organization, is an important primary task in social media text analytics. Extracting entities from code-mixed social media text is a difficult task. Three different methodologies are proposed in this paper for extracting entities from Hindi-English and Tamil-English code-mixed data. This work is submitted to the shared task on Code-Mix Entity Extraction for Indian Languages (CMEE-IL) at the Forum for Information Retrieval Evaluation (FIRE) 2016. The proposed systems include approaches based on the embedding models and feature-based model. BIO-tag formatting is done as a pre-processing step. Extraction of trigram embedding is performed during feature extraction. The development of the system is carried out using Support Vector Machine-based machine learning classifier. For the CMEE-IL task, we secured second position for Tamil-English data and third for Hindi-English. Additionally, evaluation of primary entities and their accuracies were analyzed in detail for further improvement of the system. © Springer International Publishing AG. 2018.
Cite this Research Publication : R. G. Devi, Veena, P. V., M. Kumar, A., and Dr. Soman K. P., “Entity Extraction of Hindi-English and Tamil-English Code-Mixed Social Media Text”, in Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2018, vol. 10478 LNCS, pp. 206-218.