Publication Type:

Journal Article

Source:

Indian Journal of Science and Technology, Volume 9, Number 45 (2016)

URL:

http://www.indjst.org/index.php/indjst/article/view/106520

Keywords:

Google OCR, machine translation, Parallel Corpus, Scansnap SV600 Scanner., Statistical machine translation

Abstract:

A framework involving Scansnap SV600 scanner and Google Optical character recognition (OCR) for creating parallel corpus which is a very essential component of Statistical Machine Translation (SMT). Methods and Analysis: Training a language model for a SMT system highly depends on the availability of a parallel corpus. An efficacious approach for collecting parallel sentences is the predominant step in an MT system. However, the creation of a parallel corpus requires extensive knowledge in both languages which is a time consuming process. Due to these limitations, making the documents digital becomes very difficult and which in turn affects the quality of machine translation systems. In this paper, we propose a faster and efficient way of generating English to Indian languages parallel corpus with less human involvement. With the help of a special type of scanner called Scansnap SV600 and Google OCR and a little linguistic knowledge, we can create a parallel corpus for any language pair, provided there should be paper documents with parallel sentences. Findings: It was possible to generate 40 parallel sentences in 1 hour time with this approach. Sophisticated morphological tools were used for changing the morphology of the text generated and thereby increase the size of the corpus. An additional benefit of this is to make ancient scriptures or other manuscripts in digital format which can then be referred by the coming generation to keep up the traditions of a nation or a society. Novelty: Time required for creating parallel corpus is reduced by incorporating Google OCR and book scanner.

Cite this Research Publication

B. Premjith, S Kumar, S., Shyam, R., Dr. M. Anand Kumar, and Soman, K. P., “A Fast and Efficient Framework for Creating Parallel Corpus”, Indian Journal of Science and Technology, vol. 9, 2016.

207
PROGRAMS
OFFERED
6
AMRITA
CAMPUSES
15
CONSTITUENT
SCHOOLS
A
GRADE BY
NAAC, MHRD
8th
RANK(INDIA):
NIRF 2018
150+
INTERNATIONAL
PARTNERS
  • Amrita on Social Media

  • Contact us

    Amrita Vishwa Vidyapeetham,
    Amritanagar,
    Coimbatore - 641 112,
    Tamil Nadu, India.
    • Fax                 : +91 (422) 268 6274
    • Coimbatore   : +91 (422) 268 5000
    • Amritapuri    : +91 (476) 280 1280
    • Bengaluru     : +91 (080) 251 83700
    • Kochi              : +91 (484) 280 1234
    • Mysuru          : +91 (821) 234 3479
    • Chennai         : +91 (44 ) 276 02165
    • Contact Details »