Exploration of Corpus Augmentation Approach for English-Hindi Bidirectional Statistical Machine Translation System

Publication Type : Journal Article

Publisher : International Journal of Electrical and Computer Engineering, IAES Institute of Advanced Engineering and Science.

Source : International Journal of Electrical and Computer Engineering, IAES Institute of Advanced Engineering and Science, Volume 6, Number 3, p.1059 (2016)

Url : http://search.proquest.com/openview/56d182152875617ad3d33bee7d58a192/1?pq-origsite=gscholar&cbl=1686344

Keywords : Corpus augmentation, Indian language, machine translation, Moses SMT, OOV, Statistical machine translation

Campus : Bengaluru

School : Department of Computer Science and Engineering, School of Engineering

Department : Computer Science, Mathematics

Year : 2016

Abstract : Even though lot of Statistical Machine Translation (SMT) research work is happening for English-Hindi language pair, there is no effort done to standardize the dataset. Each of the research work uses different number of sentences, datasets and parameters during various phases of translation resulting in varied translation output. So comparing these models, understand the result of these models, to get insight into corpus behavior for these models, regenerating the result of these research work becomes tedious. This necessitates the need for standardization of dataset and to identify the common parameter for the development of model. The main contribution of this paper is to introduce an approach to standardize the dataset and to identify the best parameter which in combination gives best performance. It also investigates a novel corpus augmentation approach to improve the translation quality of English-Hindi bidirectional statistical machine translation system. This model works well for the scarce resource without incorporating the external parallel data corpus of the underlying language. This experiment is carried out using Open Source phrase-based toolkit Moses. Indian Languages Corpora Initiative (ILCI) Hindi-English tourism corpus is used. With limited dataset, considerable improvement is achieved using the corpus augmentation approach for the English-Hindi bidirectional SMT system.

Cite this Research Publication : K. Jaya and Dr. Deepa Gupta, “Exploration of Corpus Augmentation Approach for English-Hindi Bidirectional Statistical Machine Translation System”, International Journal of Electrical and Computer Engineering, vol. 6, p. 1059, 2016.

About Amrita Vishwa Vidyapeetham

Rankings

Accreditation

Governance

Chancellor

Leadership

Press Media

Newsletters

Amritapuri
Campus

Amaravati
Campus

Bengaluru
Campus

Chennai
Campus

Coimbatore
Campus

Faridabad
Campus

Kochi
Campus

Mysuru
Campus

Nagercoil
Campus

Haridwar

Research

Centers

Patents

Publication