This paper presents the working methodology and results on Code Mix Entity Extraction in Indian Languages (CMEE-IL) shared the task of FIRE-2016. The aim of the task is to identify various entities such as a person, organization, movie and location names in a given code-mixed tweets. The tweets in code mix are written in English mixed with Hindi or Tamil. In this work, Entity Extraction system is implemented for both Hindi-English and Tamil-English code-mix tweets. The system employs context based character embedding features to train Support Vector Machine (SVM) classifier. The training data was tokenized such that each line containing a single word. These words were further split into characters. Embedding vectors of these characters are appended with the I-O-B tags and used for training the system. During the testing phase, we use context embedding features to predict the entity tags for characters in test data. We observed that the cross-validation accuracy using character embedding gave better results for Hindi-English twitter dataset compare to Tamil-English twitter dataset.
cited By 0; Conference of 2016 Forum for Information Retrieval Evaluation, FIRE 2016 ; Conference Date: 7 December 2016 Through 10 December 2016; Conference Code:125007
S. V. Skanda, Singh, S., G. Devi, R., Veena, P. V., Dr. M. Anand Kumar, and Dr. Soman K. P., “CEN@Amrita FIRE 2016: Context based character embeddings for entity extraction in code-mixed text”, in CEUR Workshop Proceedings, 2016, vol. 1737, pp. 321-324.