Domain biased bilingual parallel data extraction and its sentence level alignment for english-hindi pair

Publisher : Research Journal of Applied Sciences, Engineering and Technology

Campus : Bengaluru

School : School of Engineering

Department : Mathematics

Year : 2014

Abstract : pCreation of Parallel Corpora and efficient corporal alignment at sentential level for structurally distinct languages having relatively low degree of correlation remains a challenge. This work emphasizes the importance of domain biased parallel data collection and a structured methodology to obtain the same for English-Hindi language duet. Further, its sentential alignment has also been undertaken since the participating languages are structurally distinct. In essence two aspects of this study is collection of parallel corpora from different domains and aligning the extracted parallel corpus at sentence level. The proposition is intended to help researchers in the field of Natural Language Processing help contribute better in terms of accuracy, precision and robustness of their proposition. This being possible only with availability of abundant parallel corpora and more so only if the parallel corpora are available domain wise and aligned at least at sentence level. The language pair considered for the development of the algorithm is English-Hindi. The algorithm being generic in nature makes our proposition scalable to other like structured language pairs. © Maxwell Scientific Organization, 2014./p

