Publication Type:

Conference Paper

Source:

2018 International Conference on Advances in Computing, Communications and Informatics (ICACCI), IEEE, Bangalore, India (2018)

URL:

https://ieeexplore.ieee.org/abstract/document/8554918

Keywords:

adaboost, CBOW, code mixed data, code-mixed data, contemporaneous information, continuous bag of words model, Data mining, English-Hindi, exhaustive comparison study, Feature extraction, Feature vectors, Forestry, Gaussian Naive Bayes, global vectors for word representation, Glove, Indian language, Internet, kNN, learning (artificial intelligence), logistic regression, Machine learning, Machine learning algorithms, Named entity recognition, NAtural language processing, NER, Random forest, skip gram model, Social media, Social networking (online), Support vector machines, SVM, Task analysis, term frequency and inverse document frequency, TF-IDF, Tf-IdJ, word embedding model, word vector, Word2vec

Abstract:

Communication has increased many-fold in the internet era, making social media a lively platform for the exchange of information. Most people use multiple or mixed languages in their conversations as they share contemporaneous information. Code Mixing is a technique which mixes two or more languages within a dialogue. The extraction of relevant and meaningful information from mixed set of languages poses a tedious exercise. The objective of the paper is to perform named entity recognition (NER), one of the challenging task in the domain of natural language processing. The method proposed herein explores a novel exhaustive comparison study, heretofore un-addressed among four word embedding approaches like Continuous Bag of Words model (CBOW), Skip gram model, Term Frequency and Inverse Document Frequency (TF-IDF) and Global Vectors for Word Representation (GloVe). These word vector representing schemes decipher the meaning of words in different dimensions, such as in code mixed language pair English-Hindi. These word vectors or feature vectors, computed from co-occurrences, yielded good cross-validation scores when compared with six conventional machine learning algorithms. The study reveals Tf-IDF is the best word embedding model yielding the highest accuracy for the small dataset. Precision, Recall, and F-measure were used as evaluation measures.

Cite this Research Publication

L. Sravani, Reddy, A. S., and Thara, S., “A Comparison Study of Word Embedding for Detecting Named Entities of Code-Mixed Data in Indian Language”, in 2018 International Conference on Advances in Computing, Communications and Informatics (ICACCI), Bangalore, India, 2018.