Compressed Suffix Trees for Machine Translation
					
					Matthias Petri  The University of Melbourne 
					joint work with Ehsan Shareghi, Gholamreza Haffari and Trevor Cohn
					
					
				
				
					Machine Translation
					
						Use computational resources to translate a sentence from a source language to a target language.
					
				
				
					Resources
					
						- Parallel Text Corpus
						
							- Wikipedia headlines in multiple languages
- European Parliament or UN transcripts
 
- Large Text Corpus in the target language
- Lots of data sets available for free http://www.statmt.org/wmt15/translation-task.html
						
Translation Process
					 Given a Source sentence find a translation (the Target) that has the highest probability given the source sentence:
		             
		                $$P(\text{ Target }| \text{ Source }) = \frac{P(Target) \times P( \text{ Source } |  \text{ Target })}{P(Source)}$$ 
		             
		             
		             
		             	P(Source | Target)
		             	Find probable candidates using the Parallel Corpus
					
						
							
								| Language A | Language B | Word Alignment | 
						
						
							
								| lo que [X,1] que los | the [X,1] that the | 0-0 1-2 3-2 4-3 | 
							
								| lo que [X,1] que los | this [X,1] that the | 0-0 3-2 4-3 | 
							
								| lo que [X,1] que los | which [X,1] the | 0-0 1-0 4-2 | 
						
					
		             
		             
		             	P(Target)
		             	For each candidate sentence, use the text statistics of the monolingual corpus (a Language Model) to determine the "best" translation
		             
				
				
					
						q-gram Language Modeling
						
							Assign a probability to a sequence of words \(w^n_1\)
							indicating how likely the sequence is, given a language:
						
						$$
						P(w^n_1) = \prod_{i=1}^n P(w_i|w_{i-q+1}^{i-1})
						$$
						
where
						$$
							w^n_1 = S[1..n] \text{   and   } w^j_i = S[i..j]
						$$
					
					
						Example (q=3)
						
						$$
						P(\text{ the old night keeper keeps the keep in the town }) = \\
						P(\text{ the} ) \\
						\times P(\text{ old }|\text{ the }) \\
						\times P(\text{ night }|\text{ the old }) \\
						\times P(\text{ keeper }|\text{ the old night }) \\
						\times P(\text{ keeps }|\text{ old night keeper }) \\
						\times P(\text{ the }|\text{ night keeper keeps }) \\
						\times P(\text{ keep }|\text{ keeper keeps the }) \\
						\times P(\text{ in }|\text{ keeps the keep }) \\
						\times P(\text{ the }|\text{ the keep in })  \\
						\times P(\text{ town }|\text{ keep in the }) \\
						$$
						
					
				
				
					Kneser-Ney Language Modeling
					Highest Level:
					$$
					P(w_i|w_{i-q+1}^{i-1})=\frac{max(C(w_{i-q+1}^i))−D_q,0)}{C(w_{i-q+1}^{i-1})} + \frac{D_q N_{1+}(w_{i-q+1}^{i-1} \bullet )}{C(w_{i-q+1}^{i-1})} P(w_i|w_{i-n+2}^{i-1})
					$$
					
					Middle Level (\(1 < k < q\)):
					$$
					P(w_i|w_{i-k+1}^{i-1})=\frac{max(N_{1+}(\bullet w_{i-k+1}^i))−D_k,0)}{N_{1+}(\bullet w_{i-k+1}^{i-1}\bullet)} + \frac{D_k N_{1+}(w_{i-k+1}^{i-1} \bullet )}{N_{1+}(\bullet w_{i-k+1}^{i-1}\bullet)} P(w_i|w_{i-k+2}^{i-1})
					$$
					
					Lowest Level:
					$$
					P(w_i) = \frac{N_{1+}(\bullet w_{i})}{N_{1+}(\bullet \bullet)}
					$$
					
				
				
					
					Terminology
					
						- \(C(w_{i}^j)\): Standard count of \(S[i..j]\) in the corpus
- \(N_{1+}(\bullet \alpha) = |\{w: c(w \alpha)>0\}|\) 
is the number of observed word types preceding the pattern \(\alpha = w_{i}^j\)
- \(N_{1+}(\alpha \bullet ) = |\{w: c(\alpha w)>0\}|\) 
is the number of observed word types following the pattern \(\alpha\)
- \(N_{1+}(\bullet \alpha \bullet)\): the number of unique contexts (left and right) where \(\alpha \) occurs
- \(D_i\): Discount parameter for the recursion computed at index construction time
- \(N_{1+}(\bullet \bullet)\): Number of unique bi-grams
						- \(N_{1}(\alpha \bullet) = |\{w: c(\alpha w)==1\}|\) 
is the number of observed word types preceding the pattern \(\alpha = w_{i}^j\) that occur exactly once
- \(N_{2}(\alpha  \bullet) = |\{w: c(\alpha w)==2\}|\) 
is the number of observed word types preceding the pattern \(\alpha = w_{i}^j\) that occur exactly two times
- \(N_{3+}(\alpha \bullet ) = |\{w: c(\alpha w)\geq3\}|\) 
is the number of observed word types following the pattern \(\alpha\) that occur at least 3 times
ARPA Files
				 
				
				
				ARPA based Language Models
				 
				
				
				
					Instead of precomputation can we compute probabilities on the fly using Compressed Suffix Trees?
					
				
				
					Advantages
					
						- No restriction in recursion depth results in better probability estimates
						
- 
						Current approaches prestore probabilities which uses space exponential in q 
						
- 
						Construction of a CST is faster than precomputing all probabilities
						
- 
						Can operate on words and character alphabets (need larger q to be useful)
						
Use 2 CSTs over the text and reverse text
					 
				
				
					Kneser-Ney Language Modeling
				 
					
						Perform backward search and keep track of dependencies
					
				
				
					Computing \(N_{1+}(\bullet w_{i}^j \bullet)\)
					Naive Approach: (Expensive!)
					
						- Find the set S of all symbols preceding \(w_{i}^j\)
- For each \(\alpha \in S\) determine \(N_{1+}(\alpha w_{i}^j \bullet)\)
- Sum over all \(\alpha\)
Only use one wavelet tree based CST over the text
					
						- \(N_{1+}(w_{i}^j \bullet)\) computed as before using the CST
						
- 
						\(N_{1+}( \bullet w_{i}^j)\): For the range \([sp,ep]\) of the pattern use the wavelet tree to visit all leaves in \(BWT[sp,ep]\). (Interval Symbols in SDSL)
						
- 
						\(N_{1+}( \bullet w_{i}^j  \bullet )\): Visiting all leaves in \(BWT[sp,ep]\) implicitly computes all Weiner Links of the CST node corresponding to \(w_{i}^j\) as all [sp,ep] ranges of all \(\bullet w_{i}^j\) are computed during the wavelet tree traversal.
						
- 
						Determine number of children of the found nodes to compute \(N_{1+}( \bullet w_{i}^j  \bullet )\).
						
				
					Other Considerations
					
						- Special case when pattern search ends in the middle of an edge in the CST
						
- 
						Special handling of start and end of sentence tags which can mess up the correct counts
						
- 
						Ensure correctness by comparing to state-of-the-art systems (KenLM and SRILM)
						
				
				Construction and Query Time
					 
				
				
				Query Time Breakdown
					 
				
				
					Future Work
					
						- Precomputing some of the counts is very cheap and speeds up query processing significantly
- Alphabet-Partitioning for larger alphabets (requires interval symbols)
- Already competitive to state-of-the-art implementations
- Backward Search now main cost factor
- Lots of open problems in the MT field where succinct structures could be applied to
- Easy entry as test collections and software are freely available