Compressed Suffix Trees for Machine Translation

Matthias Petri
The University of Melbourne

joint work with Ehsan Shareghi, Gholamreza Haffari and Trevor Cohn

Machine Translation

Use computational resources to translate a sentence from a source language to a target language.

Resources

  • Parallel Text Corpus
    • Wikipedia headlines in multiple languages
    • European Parliament or UN transcripts
  • Large Text Corpus in the target language
  • Lots of data sets available for free http://www.statmt.org/wmt15/translation-task.html

Translation Process

Given a Source sentence find a translation (the Target) that has the highest probability given the source sentence:


$$P(\text{ Target }| \text{ Source }) = \frac{P(Target) \times P( \text{ Source } | \text{ Target })}{P(Source)}$$

P(Source | Target)

Find probable candidates using the Parallel Corpus

Language A Language B Word Alignment
lo que [X,1] que los the [X,1] that the 0-0 1-2 3-2 4-3
lo que [X,1] que los this [X,1] that the 0-0 3-2 4-3
lo que [X,1] que los which [X,1] the 0-0 1-0 4-2

P(Target)

For each candidate sentence, use the text statistics of the monolingual corpus (a Language Model) to determine the "best" translation

q-gram Language Modeling

Assign a probability to a sequence of words \(w^n_1\) indicating how likely the sequence is, given a language:

$$ P(w^n_1) = \prod_{i=1}^n P(w_i|w_{i-q+1}^{i-1}) $$
where $$ w^n_1 = S[1..n] \text{ and } w^j_i = S[i..j] $$

Example (q=3)

$$ P(\text{ the old night keeper keeps the keep in the town }) = \\ P(\text{ the} ) \\ \times P(\text{ old }|\text{ the }) \\ \times P(\text{ night }|\text{ the old }) \\ \times P(\text{ keeper }|\text{ the old night }) \\ \times P(\text{ keeps }|\text{ old night keeper }) \\ \times P(\text{ the }|\text{ night keeper keeps }) \\ \times P(\text{ keep }|\text{ keeper keeps the }) \\ \times P(\text{ in }|\text{ keeps the keep }) \\ \times P(\text{ the }|\text{ the keep in }) \\ \times P(\text{ town }|\text{ keep in the }) \\ $$

Kneser-Ney Language Modeling

Highest Level:

$$ P(w_i|w_{i-q+1}^{i-1})=\frac{max(C(w_{i-q+1}^i))−D_q,0)}{C(w_{i-q+1}^{i-1})} + \frac{D_q N_{1+}(w_{i-q+1}^{i-1} \bullet )}{C(w_{i-q+1}^{i-1})} P(w_i|w_{i-n+2}^{i-1}) $$

Middle Level (\(1 < k < q\)):

$$ P(w_i|w_{i-k+1}^{i-1})=\frac{max(N_{1+}(\bullet w_{i-k+1}^i))−D_k,0)}{N_{1+}(\bullet w_{i-k+1}^{i-1}\bullet)} + \frac{D_k N_{1+}(w_{i-k+1}^{i-1} \bullet )}{N_{1+}(\bullet w_{i-k+1}^{i-1}\bullet)} P(w_i|w_{i-k+2}^{i-1}) $$

Lowest Level:

$$ P(w_i) = \frac{N_{1+}(\bullet w_{i})}{N_{1+}(\bullet \bullet)} $$

Terminology

  • \(C(w_{i}^j)\): Standard count of \(S[i..j]\) in the corpus
  • \(N_{1+}(\bullet \alpha) = |\{w: c(w \alpha)>0\}|\) is the number of observed word types preceding the pattern \(\alpha = w_{i}^j\)
  • \(N_{1+}(\alpha \bullet ) = |\{w: c(\alpha w)>0\}|\) is the number of observed word types following the pattern \(\alpha\)
  • \(N_{1+}(\bullet \alpha \bullet)\): the number of unique contexts (left and right) where \(\alpha \) occurs
  • \(D_i\): Discount parameter for the recursion computed at index construction time
  • \(N_{1+}(\bullet \bullet)\): Number of unique bi-grams
  • \(N_{1}(\alpha \bullet) = |\{w: c(\alpha w)==1\}|\) is the number of observed word types preceding the pattern \(\alpha = w_{i}^j\) that occur exactly once
  • \(N_{2}(\alpha \bullet) = |\{w: c(\alpha w)==2\}|\) is the number of observed word types preceding the pattern \(\alpha = w_{i}^j\) that occur exactly two times
  • \(N_{3+}(\alpha \bullet ) = |\{w: c(\alpha w)\geq3\}|\) is the number of observed word types following the pattern \(\alpha\) that occur at least 3 times

ARPA Files

ARPA based Language Models

Instead of precomputation can we compute probabilities on the fly using Compressed Suffix Trees?

Advantages

  • No restriction in recursion depth results in better probability estimates
  • Current approaches prestore probabilities which uses space exponential in q
  • Construction of a CST is faster than precomputing all probabilities
  • Can operate on words and character alphabets (need larger q to be useful)

Use 2 CSTs over the text and reverse text

Kneser-Ney Language Modeling

Perform backward search and keep track of dependencies

Computing \(N_{1+}(\bullet w_{i}^j \bullet)\)

Naive Approach: (Expensive!)

  1. Find the set S of all symbols preceding \(w_{i}^j\)
  2. For each \(\alpha \in S\) determine \(N_{1+}(\alpha w_{i}^j \bullet)\)
  3. Sum over all \(\alpha\)

Only use one wavelet tree based CST over the text

  • \(N_{1+}(w_{i}^j \bullet)\) computed as before using the CST
  • \(N_{1+}( \bullet w_{i}^j)\): For the range \([sp,ep]\) of the pattern use the wavelet tree to visit all leaves in \(BWT[sp,ep]\). (Interval Symbols in SDSL)
  • \(N_{1+}( \bullet w_{i}^j \bullet )\): Visiting all leaves in \(BWT[sp,ep]\) implicitly computes all Weiner Links of the CST node corresponding to \(w_{i}^j\) as all [sp,ep] ranges of all \(\bullet w_{i}^j\) are computed during the wavelet tree traversal.
  • Determine number of children of the found nodes to compute \(N_{1+}( \bullet w_{i}^j \bullet )\).

Other Considerations

  • Special case when pattern search ends in the middle of an edge in the CST
  • Special handling of start and end of sentence tags which can mess up the correct counts
  • Ensure correctness by comparing to state-of-the-art systems (KenLM and SRILM)

Construction and Query Time

Query Time Breakdown

Future Work

  • Precomputing some of the counts is very cheap and speeds up query processing significantly
  • Alphabet-Partitioning for larger alphabets (requires interval symbols)
  • Already competitive to state-of-the-art implementations
  • Backward Search now main cost factor
  • Lots of open problems in the MT field where succinct structures could be applied to
  • Easy entry as test collections and software are freely available