Compressed Suffix Trees for Machine Translation
Matthias Petri
The University of Melbourne
joint work with Ehsan Shareghi, Gholamreza Haffari and Trevor Cohn
Machine Translation
Use computational resources to translate a sentence from a source language to a target language.
Resources
- Parallel Text Corpus
- Wikipedia headlines in multiple languages
- European Parliament or UN transcripts
- Large Text Corpus in the target language
- Lots of data sets available for free http://www.statmt.org/wmt15/translation-task.html
Translation Process
Given a Source sentence find a translation (the Target) that has the highest probability given the source sentence:
P( Target | Source )=P(Target)×P( Source | Target )P(Source)
P(Source | Target)
Find probable candidates using the Parallel Corpus
Language A |
Language B |
Word Alignment |
lo que [X,1] que los |
the [X,1] that the |
0-0 1-2 3-2 4-3 |
lo que [X,1] que los |
this [X,1] that the |
0-0 3-2 4-3 |
lo que [X,1] que los |
which [X,1] the |
0-0 1-0 4-2 |
P(Target)
For each candidate sentence, use the text statistics of the monolingual corpus (a Language Model) to determine the "best" translation
q-gram Language Modeling
Assign a probability to a sequence of words wn1
indicating how likely the sequence is, given a language:
P(wn1)=n∏i=1P(wi|wi−1i−q+1)
where
wn1=S[1..n] and wji=S[i..j]
Example (q=3)
P( the old night keeper keeps the keep in the town )=P( the)×P( old | the )×P( night | the old )×P( keeper | the old night )×P( keeps | old night keeper )×P( the | night keeper keeps )×P( keep | keeper keeps the )×P( in | keeps the keep )×P( the | the keep in )×P( town | keep in the )
Kneser-Ney Language Modeling
Highest Level:
P(wi|wi−1i−q+1)=max(C(wii−q+1))−Dq,0)C(wi−1i−q+1)+DqN1+(wi−1i−q+1∙)C(wi−1i−q+1)P(wi|wi−1i−n+2)
Middle Level (1<k<q):
P(wi|wi−1i−k+1)=max(N1+(∙wii−k+1))−Dk,0)N1+(∙wi−1i−k+1∙)+DkN1+(wi−1i−k+1∙)N1+(∙wi−1i−k+1∙)P(wi|wi−1i−k+2)
Lowest Level:
P(wi)=N1+(∙wi)N1+(∙∙)
Terminology
- C(wji): Standard count of S[i..j] in the corpus
- N1+(∙α)=|{w:c(wα)>0}|
is the number of observed word types preceding the pattern α=wji
- N1+(α∙)=|{w:c(αw)>0}|
is the number of observed word types following the pattern α
- N1+(∙α∙): the number of unique contexts (left and right) where α occurs
- Di: Discount parameter for the recursion computed at index construction time
- N1+(∙∙): Number of unique bi-grams
- N1(α∙)=|{w:c(αw)==1}|
is the number of observed word types preceding the pattern α=wji that occur exactly once
- N2(α∙)=|{w:c(αw)==2}|
is the number of observed word types preceding the pattern α=wji that occur exactly two times
- N3+(α∙)=|{w:c(αw)≥3}|
is the number of observed word types following the pattern α that occur at least 3 times
ARPA Files
ARPA based Language Models
Instead of precomputation can we compute probabilities on the fly using Compressed Suffix Trees?
Advantages
- No restriction in recursion depth results in better probability estimates
-
Current approaches prestore probabilities which uses space exponential in q
-
Construction of a CST is faster than precomputing all probabilities
-
Can operate on words and character alphabets (need larger q to be useful)
Use 2 CSTs over the text and reverse text
Kneser-Ney Language Modeling
Perform backward search and keep track of dependencies
Computing N1+(∙wji∙)
Naive Approach: (Expensive!)
- Find the set S of all symbols preceding wji
- For each α∈S determine N1+(αwji∙)
- Sum over all α
Only use one wavelet tree based CST over the text
- N1+(wji∙) computed as before using the CST
-
N1+(∙wji): For the range [sp,ep] of the pattern use the wavelet tree to visit all leaves in BWT[sp,ep]. (Interval Symbols in SDSL)
-
N1+(∙wji∙): Visiting all leaves in BWT[sp,ep] implicitly computes all Weiner Links of the CST node corresponding to wji as all [sp,ep] ranges of all ∙wji are computed during the wavelet tree traversal.
-
Determine number of children of the found nodes to compute N1+(∙wji∙).
Other Considerations
- Special case when pattern search ends in the middle of an edge in the CST
-
Special handling of start and end of sentence tags which can mess up the correct counts
-
Ensure correctness by comparing to state-of-the-art systems (KenLM and SRILM)
Construction and Query Time
Query Time Breakdown
Future Work
- Precomputing some of the counts is very cheap and speeds up query processing significantly
- Alphabet-Partitioning for larger alphabets (requires interval symbols)
- Already competitive to state-of-the-art implementations
- Backward Search now main cost factor
- Lots of open problems in the MT field where succinct structures could be applied to
- Easy entry as test collections and software are freely available
Compressed Suffix Trees for Machine Translation
Matthias Petri The University of Melbourne
joint work with Ehsan Shareghi, Gholamreza Haffari and Trevor Cohn