Publication Type:

Journal Article

Source:

(2015)

Keywords:

Information Retrieval, Language identification, Mixed Script, Short Message., Support vector machine (SVM)

Abstract:

The progression of social media contents, similar like Twitter and Facebook messages and blog post, has created, many new opportunities for language technology. The user generated contents such as tweets and blogs in most of the languages are written using Roman script due to distinct social culture and technology. Some of them using own language script and mixed script. The primary challenges in process the short message is identifying languages. Therefore, the language identification is not restricted to a language but also to multiple languages. The task is to label the words with the following categories L1, L2, Named Entities, Mixed, Punctuation and Others This paper presents the AmritaCen_NLP team participation in FIRE2015-Shared Task on Mixed Script Information Retrieval Subtask 1: Query Word Labeling on language identification of each word in text, Named Entities, Mixed, Punctuation and Others which uses sequence level query labelling with Support Vector Machine.

Cite this Research Publication

R. Venkatesh Kumar, Dr. M. Anand Kumar, and Soman, K. P., “AmritaCEN_NLP@ FIRE 2015 Language Identification for Indian Languages in Social Media Text”, 2015.

207
PROGRAMS
OFFERED
6
AMRITA
CAMPUSES
15
CONSTITUENT
SCHOOLS
A
GRADE BY
NAAC, MHRD
8th
RANK(INDIA):
NIRF 2018
150+
INTERNATIONAL
PARTNERS
  • Amrita on Social Media

  • Contact us

    Amrita Vishwa Vidyapeetham,
    Amritanagar,
    Coimbatore - 641 112,
    Tamil Nadu, India.
    • Fax                 : +91 (422) 268 6274
    • Coimbatore   : +91 (422) 268 5000
    • Amritapuri    : +91 (476) 280 1280
    • Bengaluru     : +91 (080) 251 83700
    • Kochi              : +91 (484) 280 1234
    • Mysuru          : +91 (821) 234 3479
    • Chennai         : +91 (44 ) 276 02165
    • Contact Details »