Complementary Approaches to Tree Alignment: Combining
Statistical and Rule-Based Methods

B 2013

Complementary Approaches to Tree Alignment: Combining Statistical and Rule-Based Methods

KOTZÉ, Gideon

Basic information

Original name

Complementary Approaches to Tree Alignment: Combining Statistical and Rule-Based Methods

Authors

KOTZÉ, Gideon

Edition

Groningen, The Netherlands, 221 pp. 2013

Publisher

University of Groningen

Other information

Type of outcome

Odborná kniha

Confidentiality degree

není předmětem státního či obchodního tajemství

Publication form

electronic version available online

ISBN

978-90-367-6177-2

Keywords in English

syntactic tree alignment, constituent alignment, word alignment, treebanks, parallel treebanks, parallel corpora, syntax-based machine translation, maximum entropy, transformation-based learning, rule-based learning, heuristics

Abstract

V originále

Large collections of translated texts—called parallel corpora—are often automatically aligned on word and sentence level to be used as training data for machine translation systems. We may also choose to syntactically analyze the sentences to produce syntax trees. If we do this on both sides and the nodes of the trees are also aligned, the end result is called a parallel treebank. The best translation systems are statistically based, but in recent years there has been a shift to the incorporation of more linguistically motivated data, which includes the use of parallel treebanks. These are only useful on a very large scale because of the amount of information a system needs about how one language is to be translated into another in order to be effective. Because of this, we investigate techniques for the automatic and accurate alignment of these nodes. Another motive for our research is the fact that parallel treebanks are also useful for other techniques and that as a linguistic resource, remain scientifically interesting. This process is called tree alignment. We find that a combination of statistical and rule-based techniques, using relatively small sets of training data and few features, is sufficient to produce very accurate alignments. Finally, we also find that when we apply alignments covering a relatively large set of nodes—even though some of them are wrong—on a syntax-based machine translation system, this leads to better translation results than applying alignments that are more accurate but fewer in number.

Citovat

KOTZÉ, Gideon. Complementary Approaches to Tree Alignment: Combining Statistical and Rule-Based Methods. Online. Groningen, The Netherlands: University of Groningen, 2013, 221 pp. ISBN 978-90-367-6177-2.

@book{2228632,
   author = {Kotzé, Gideon},
   address = {Groningen, The Netherlands},
   keywords = {syntactic tree alignment, constituent alignment, word alignment, treebanks, parallel treebanks, parallel corpora, syntax-based machine translation, maximum entropy, transformation-based learning, rule-based learning, heuristics},
   howpublished = {elektronická verze "online"},
   location = {Groningen, The Netherlands},
   isbn = {978-90-367-6177-2},
   publisher = {University of Groningen},
   title = {Complementary Approaches to Tree Alignment: Combining Statistical and Rule-Based Methods},
   year = {2013}
}

TY  - BOOK
ID  - 2228632
AU  - Kotzé, Gideon
PY  - 2013
TI  - Complementary Approaches to Tree Alignment: Combining Statistical and Rule-Based Methods
PB  - University of Groningen
CY  - Groningen, The Netherlands
SN  - 9789036761772
KW  - syntactic tree alignment, constituent alignment, word alignment, treebanks, parallel treebanks, parallel corpora, syntax-based machine translation, maximum entropy, transformation-based learning, rule-based learning, heuristics
N2  - Large collections of translated texts—called parallel corpora—are often automatically aligned on word and sentence level to be used as training data for machine translation systems. We may also choose to syntactically analyze the sentences to produce syntax trees. If we do this on both sides and the nodes of the trees are also aligned, the end result is called a parallel treebank. The best translation systems are statistically based, but in recent years there has been a shift to the incorporation of more linguistically motivated data, which includes the use of parallel treebanks. These are only useful on a very large scale because of the amount of information a system needs about how one language is to be translated into another in order to be effective. Because of this, we investigate techniques for the automatic and accurate alignment of these nodes. Another motive for our research is the fact that parallel treebanks are also useful for other techniques and that as a linguistic resource, remain scientifically interesting. This process is called tree alignment. We find that a combination of statistical and rule-based techniques, using relatively small sets of training data and few features, is sufficient to produce very accurate alignments. Finally, we also find that when we apply alignments covering a relatively large set of nodes—even though some of them are wrong—on a syntax-based machine translation system, this leads to better translation results than applying alignments that are more accurate but fewer in number.
ER  -

KOTZÉ, Gideon. \textit{Complementary Approaches to Tree Alignment: Combining Statistical and Rule-Based Methods}. Online. Groningen, The Netherlands: University of Groningen, 2013, 221 pp. ISBN~978-90-367-6177-2.

Detailed Information on Publication Record