Dealing with out-of-vocabulary problem in sentence alignment using word similarity

Sentence alignment plays an essential role in building bilingual corpora which are valuable resources for many applications like statistical machine translation. In various approaches of sentence alignment, length-and-word-based methods which are based on sentence length and word correspondences have been shown to be the most effective. Nevertheless a drawback of using bilingual dictionaries trained by IBM Models in length-and-word-based methods is the problem of out-of-vocabulary (OOV). We propose using word similarity learned from monolingual corpora to overcome the problem. Experimental results showed that our method can reduce the OOV ratio and achieve a better performance than some other lengthand- word-based methods. This implies that using word similarity learned from monolingual data may help to deal with OOV problem in sentence alignment.

Title: 

Dealing with out-of-vocabulary problem in sentence alignment using word similarity
Authors: Trieu, H.-L.
Nguyen, L.-M.
Nguyen, P.-T.
Keywords: Monolingual data
Out-ofvocabulary
Sentence alignment
Word similarity
Issue Date: 2016
Publisher: Institute for the Study of Language and Information
Citation: Scopus
Abstract: Sentence alignment plays an essential role in building bilingual corpora which are valuable resources for many applications like statistical machine translation. In various approaches of sentence alignment, length-and-word-based methods which are based on sentence length and word correspondences have been shown to be the most effective. Nevertheless a drawback of using bilingual dictionaries trained by IBM Models in length-and-word-based methods is the problem of out-of-vocabulary (OOV). We propose using word similarity learned from monolingual corpora to overcome the problem. Experimental results showed that our method can reduce the OOV ratio and achieve a better performance than some other lengthand- word-based methods. This implies that using word similarity learned from monolingual data may help to deal with OOV problem in sentence alignment.
Description: Proceedings of the 30th Pacific Asia Conference on Language, Information and Computation, PACLIC 2016 2016, Pages 259-266
URI: http://repository.vnu.edu.vn/handle/VNU_123/29802
ISBN: 978-896817428-5
Appears in Collections:Bài báo của ĐHQGHN trong Scopus

Nhận xét

Bài đăng phổ biến từ blog này

Quản lý đội ngũ công chức, viên chức của Tổng cục Dự trữ Nhà nước

Study of eta-eta ' mixing from measurement of B-(s)(0) -> J/psi eta((')) decay rates

Xây dựng chiến lược phát triển cơ sở đào tạo tại miền Trung của Trường Đại học Nội Vụ Hà Nội đến năm 2020