A Fast and Efficient Framework for Creating Parallel Corpus

Publication Type : Journal Article

Publisher : Indian Journal of Science and Technology

Source : Indian Journal of Science and Technology, Volume 9, Number 45 (2016)

Url :

Keywords : Google OCR, machine translation, Parallel Corpus, Scansnap SV600 Scanner., Statistical machine translation

Campus : Coimbatore

School : School of Engineering

Center : Computational Engineering and Networking

Department : Computer Science, Center for Computational Engineering and Networking (CEN)

Year : 2016

Abstract : A framework involving Scansnap SV600 scanner and Google Optical character recognition (OCR) for creating parallel corpus which is a very essential component of Statistical Machine Translation (SMT). Methods and Analysis: Training a language model for a SMT system highly depends on the availability of a parallel corpus. An efficacious approach for collecting parallel sentences is the predominant step in an MT system. However, the creation of a parallel corpus requires extensive knowledge in both languages which is a time consuming process. Due to these limitations, making the documents digital becomes very difficult and which in turn affects the quality of machine translation systems. In this paper, we propose a faster and efficient way of generating English to Indian languages parallel corpus with less human involvement. With the help of a special type of scanner called Scansnap SV600 and Google OCR and a little linguistic knowledge, we can create a parallel corpus for any language pair, provided there should be paper documents with parallel sentences. Findings: It was possible to generate 40 parallel sentences in 1 hour time with this approach. Sophisticated morphological tools were used for changing the morphology of the text generated and thereby increase the size of the corpus. An additional benefit of this is to make ancient scriptures or other manuscripts in digital format which can then be referred by the coming generation to keep up the traditions of a nation or a society. Novelty: Time required for creating parallel corpus is reduced by incorporating Google OCR and book scanner.

Cite this Research Publication : Premjith B, Sachin Kumar S, Shyam R, M Anand Kumar, K P Soman, A Fast and Efficient Framework for creating Parallel Corpus, IJST, Vol 9, 2016

