Compressed Suffix Trees for Machine Translation

Matthias Petri
The University of Melbourne

joint work with Ehsan Shareghi, Gholamreza Haffari and Trevor Cohn

Language A	Language B	Word Alignment
lo que [X,1] que los	the [X,1] that the	0-0 1-2 3-2 4-3
lo que [X,1] que los	this [X,1] that the	0-0 3-2 4-3
lo que [X,1] que los	which [X,1] the	0-0 1-0 4-2

Compressed Suffix Trees for Machine Translation Matthias Petri The University of Melbourne joint work with Ehsan Shareghi, Gholamreza Haffari and Trevor Cohn

Compressed Suffix Trees for Machine Translation

Machine Translation

Resources

Translation Process

P(Source | Target)

P(Target)

q-gram Language Modeling

Example (q=3)

Kneser-Ney Language Modeling

Terminology

ARPA Files

ARPA based Language Models

Instead of precomputation can we compute probabilities on the fly using Compressed Suffix Trees?

Advantages

Use 2 CSTs over the text and reverse text

Kneser-Ney Language Modeling

Computing $N_{1+}(\bullet w_{i}^j \bullet)$

Only use one wavelet tree based CST over the text

Other Considerations

Special case when pattern search ends in the middle of an edge in the CST

Special handling of start and end of sentence tags which can mess up the correct counts

Ensure correctness by comparing to state-of-the-art systems (KenLM and SRILM)

Construction and Query Time

Query Time Breakdown

Future Work