This paper presents a wide range of statistical word alignment experiments incorporating morphosyntactic information. By means of parallel corpus transformations according to information of POS-tagging, lemmatization or stemming, we explore which linguistic information helps improve alignment error rates. For this, evaluation against a human word alignment reference is performed, aiming at an improved machine translation training scheme which eventually leads to improved SMT performance. Experiments are carried out in a Spanish–English European Parliament Proceedings parallel corpus, both in a large and a small data track. As expected, improvements due to introducing morphosyntactic information are bigger in case of data scarcity, but significant improvement is also achieved in a large data task, meaning that certain linguistic knowledge is relevant even in situations of large data availability.
A. De Gispert, Dr. Deepa Gupta, Popović, M., Lambert, P., Mariño, J. B., Federico, M., Ney, H., and Banchs, R., “Improving statistical word alignments with morpho-syntactic transformations”, in Advances in Natural Language Processing, Springer Berlin Heidelberg, 2006, pp. 368–379.