on using very large target vocabulary for neural machine translation
Overview
This is a summary of the paper by S. Jean, K. Cho, R Memisevic, and Y. Bengio entitled "On Using Very Large Target Vocabulary for Neural Machine Translation" <ref>S. Jean, K. Cho, R Memisevic, and Y. Bengio. "On Using Very Large Target Vocabulary for Neural Machine Translation", 2015.</ref> The paper presents the application of importance sampling for neural machine translation with a very large target vocabulary. Despite the advantages of neural networks in machine translation over statistical machine translation systems such as the phrasebased system, they suffer from some technical problems. Most importantly, they are limited to a small vocabulary because of complexity and number of parameters that have to be trained as total vocabulary increases. The output of a RNN used for machine translation will have as many dimensions as there are words in the vocabulary. If the total vocabulary consists of hundreds of thousand of words, then the RNN must compute a very expensive softmax on the output vector at each time step and estimate the probability of each word as the next word in the sequence. Therefore, the number of parameters in the RNN will also grow very large in such cases given that number of weights between the hidden layer and output layer will be equal to the product of the number of units in each layer. For a nontrivial sized hidden layer, a large vocabulary could result in tens of millions of model parameters purely associated with the hiddentooutput mapping. In practice, researchers who apply RNNs to machine translation have avoided this problem by restricting the model vocabulary to only include some shortlist of words in the target language. Words not in this shortlist are treated as unknown by the model and assigned a special 'UNK' token. This technique understandably impairs translation performance when the target sentence includes a large number of words not present in the vocabulary such as names.
In this paper Jean and his colleagues aim to solve this problem by proposing a training method based on importance sampling which uses a large target vocabulary without increasing training complexity. The proposed algorithm demonstrates better performance without losing efficiency in time or speed. The algorithm is tested on two machine translation tasks (English [math]\rightarrow[/math] German, and English [math]\rightarrow[/math] French), and it achieved the best performance out of any previous single neural machine translation (NMT) system on the English [math]\rightarrow[/math] French translation task.
Methods
Recall that the classic neural machine translation system works through an encoderdecoder network. The encoder reads the source sentence x and encode it into a sequence of hidden states of h where [math]h_t=f(x_t,h_{t1})[/math]. In the decoder step, another neural network generates the translation vector of y based on the encoded sequence of hidden states h: [math]p(y_t\,\,y_{\lt t},x)\propto \exp\{q(y_{t1}, z_t, c_t)\}[/math] where [math]\, z_t=g(y_{t1}, z_{t1}, c_t)[/math] and [math]\, c_t=r(z_{t1}, h_1,..., H_T)[/math]
The objective function which have to be maximized represented by [math]\theta=\arg\max\sum_{n=1}^{N}\sum_{t=1}^{T_n}\log p(y_t^n\,\,y_{\lt t}^n, x^n)[/math]
where [math](x^n, y^n)[/math] is the nth training pair of sentence, and [math]T_n[/math] is the length of nth target sentence [math]y^n[/math]. The proposed model is based on specific implementation of neural machine translation that uses an attention mechanism, as recently proposed in <ref> Bahdanau et al.,NEURAL MACHINE TRANSLATION BY JOINTLY LEARNING TO ALIGN AND TRANSLATE, 2014 </ref>. In that the encoder is implemented by a bidirectional recurrent neural network,[math]h_t=[h_t^\leftarrow; h_t^\rightarrow][/math]. The decoder, at each time, computes the context vector [math]c_t[/math] as a convex sum of the hidden states [math](h_1,...,h_T)[/math] with the coefficients [math](\alpha_1,...,\alpha_T)[/math] computed by
[math]\alpha_t=\frac{\exp\{a(h_t, z_t)\}}{\sum_{k}\exp\{a(h_t, z_t)\}}[/math] where a is a feedforward neural network with a single hidden layer. Then the probability of the next target word is
[math]p(y_t\ y_{\lt t}, x)=\frac{1}{Z} \exp\{W_t^T\phi(y_{t1}, z_t, c_t)+b_t\}[/math]. In that [math]\phi[/math] is an affine transformation followed by a nonlinear activation, [math]w_t[/math] and [math]b_t[/math] are the target word vector and the target word bias, respectively. Z is the normalization constant computed by
[math] Z=\sum_{k:y_k\in V}\exp\left(W_t^T\phi(y_{t1}, z_t, c_t)+b_t\right)[/math] where V is set of all the target words.
The dot product between the feature [math]\phi(y_{t1}, z_t, c_t)[/math] and [math]w_t[/math] is required to be done for all words in target vocabulary and is computationally complex and time consuming. Furthermore, the memory requirements grow linearly with respect to the number of target word. This has been a major hurdle for neural machine translations. Recent approaches use a shortlist of 30,000 to 80,000 most frequent words. This makes training more feasible but also has problems of its own. For example, the model degrades heavily if the translation of the source sentence requires many words that are not included in the shortlist. The approach of this paper uses only a subset of sampled target words as an align vector to maximize Eq (6), instead of all the likely target words. The most naïve way to select a subset of target words is selection of K most frequent words. However, This skipping of words from training processes is in contrast with using a large vocabulary, because practically we removed a bunch of words from target dictionary. Jean et al., proposed using an existing word alignment model to align the source and target words in the training corpus and build a dictionary. With the dictionary, for each source sentence, we construct a target word set consisting of the Kmost frequent words (according to the estimated unigram probability) and, using the dictionary, at most [math]k\prime[/math] likely target words for each source word. K and [math]k\prime[/math] may be chosen either to meet the computational requirement or to maximize the translation performance on the development set.
In order to avoid the growing complexity of computing the normalization constant, the authors proposed to use only a small subset [math]v\prime[/math] of the target vocabulary at each update<ref>
Bengio and Sen´ et al, Adaptive Importance Sampling to Accelerate Training of a Neural Probabilistic Language Model ,IEEEXplor, 2008
</ref>.
Let us consider the gradient of the log probability of the output in conditional probability of [math]y_t[/math]. The gradient is composed of a positive and negative part:
[math]\bigtriangledown=\log p(y_tY_{\lt t}, x_t)=\bigtriangledown \mathbf\varepsilon(y_t)\sum_{k:y_k\in V} p(y_ky_{\lt t}, x) \bigtriangledown \mathbf\varepsilon(y_t) [/math]
where the energy [math]\mathbf\varepsilon[/math] is defined as [math]\mathbf\varepsilon(y_i)=W_j^T\phi(y_{j1}, Z_j, C_j)+b_j[/math]. The second term of gradiant is in essence the expected gradiant of the energy as [math]\mathbb E_P[\bigtriangledown \epsilon(y)][/math] where P denotes [math]p(yy_{\lt t}, x)[/math].
The idea of the proposed approach is to approximate this expectation of the gradient by importance sampling with a small number of samples. Given a predefined proposal distribution Q and a set [math]v\prime[/math] of samples from Q, we approximate the expectation with
[math]\mathbb E_P[\bigtriangledown \epsilon(y)][/math] where P denotes [math]p(yy_{\lt t}, x)\approx \sum_{k:y_k\in V\prime} \frac{w_k}{\sum_{k\prime:y_k\prime\in V\prime}w_k\prime}\epsilon(y_k)[/math] where [math]\,w_k=exp{\epsilon(y_k)log Q(y_k)}[/math]
In practice, the training corpus is partitioned and a subset [math]v\prime[/math] of the target vocabulary is defined for each partition prior to training. Before training begins, each target sentence in the training corpus is sequentially examined and accumulate unique target words until the number of unique target words reaches the predefined threshold τ . The accumulated vocabulary will be used for this partition of the corpus during training. This processes is repeated until the end of the training set is reached.
In this approach the alignments between the target words and source locations via the alignment model is obtained. This is useful when the model generated an Un token. Once a translation is generated given a source sentence, each Un may be replaced using a translationspecific technique based on the aligned source word. The authors in the experiment, replaced each Un token with the aligned source word or its most likely translation determined by another word alignment model. The proposed approach was evaluated in English>French and EnglishGerman translation. The neural machine translation model was trained by the bilingual, parallel corpora made available as part of WMT’14. The data sets were used for English to French were European v7, Common Crawl, UN, News Commentary, Gigaword. The data sets for EnglishGerman were Europarl v7, Common Crawl, News Commentary.
The models were evaluated on the WMT’14 test set (newstest 2014)3 , while the concatenation of newstest2012 and newstest2013 is used for model selection (development set). Table 1 presents data coverage w.r.t. the vocabulary size, on the target side.
Setting
As a baseline for English→French translation, the authors used the RNNsearch model proposed by (Bahdanau et al., 2014), with 30,000 source and target words and another RNNsearch was trained for English→German translation with 50,000 source and target words. Using the proposed approach another set of RNNsearch models with much larger vocabularies of 500,000 source and target words was trained for each language pair. Different shortlist sizes used during training: 15,000 and 30,000 for English→French, and 15,000 and 50,000 for English→German. The best performance on the development set were evaluated and reported every twelve hours. For both language pairs, new models were trained with shortlist size of 15, 000 and 50, 000 by reshuffling the data set at the beginning of each epoch. While this causes a nonnegligible amount of overhead, such a change allows words to be contrasted with different sets of words for each epoch. The beam search was used to generate a translation given a source. The authors keep a set of 12 hypotheses and normalize probabilities by the length of the candidate sentences which was chosen to maximize the performance on the development set, for K ∈ {15k, 30k, 50k} and K0 ∈ {10, 20}. They used a bilingual dictionary to accelerate decoding and to replace unknown words in translations.
Results
The results for English> French translation obtained by the trained models with very large target vocabularies compared with results of previous models reported in Table below.
Method  RNNsearch  RNNsearchLV  Phrasebased SMT (cHO et al)  Phrasebased SMT (Durrani et al)  

BASIC NMT  29.97 (26.58)  32.68 (28.76)  30.6  33.3  37.03 
+ Candidate List
+ UNK Replace 
33.08 (29.08)  33.36 (29.32)
34.11 (29.98) 

33.1 
33.3  37.03 
+ Reshuffle (tau=50)    34.6 (30.53)    33.3  37.03 
+ Ensemble    37.19 (31.98)  37.5  33.3  3703 
And the results for English>German translation in Table below.
Method  RNNsearch  RNNsearchLV  Phrasebased SMT 

BASIC NMT  16.46 (17.13)  16.95 (17.85)  20.67 
+ Candidate List
+ UNK Replace 
18.97 (19.16)  17.46 (18.00)
18.89 (19.03) 
20.67 
+ Reshuffle (tau=50)    19.4  20.67 
+ Ensemble    21.59  20.67 
It is clear that the RNNsearchLV outperforms the baseline RNNsearch. In the case of the English→French task, RNNsearchLV approached the performance level of the previous best single neural machine translation (NMT) model, even without any translation specific techniques. With these, however, the RNNsearchLV outperformed it. The performance of the RNNsearchLV is also better than that of a standard phrasebased translation system. For English→German, the RNNsearchLV outperformed the baseline before unknown word replacement, but after doing so, the two systems performed similarly. A higher large vocabulary singlemodel performance is achieved by reshuffling the data set. In this case, we were able to surpass the previously reported best translation result on this task by building an ensemble of 8 models. With τ = 15, 000, the RNNsearchLV performance worsened a little, with best BLEU scores, without reshuffling, of 33.76 and 18.59 respectively for English→French and English→German.
The timing information of decoding for different models were presented in Table below. While decoding from RNNsearchLV with the full target vocabulary is slowest, the speed substantially improves if a candidate list for decoding each translation is used.
Method  CPU i74820k  GPU GTX TITAN black 

RNNsearch  0.09 s  0.02 s 
RNNsearchLV  0.80 s  0.25 s 
RNNsearchLV
+Candidate list 
0.12 s  0.0.05 s 
The influence of the target vocabulary when translating the test sentences by using the union of a fixed set of 30, 000 common words and (at most) K0 likely candidates for each source word was evaluated for English→French with size of 30, 000. The performance of the system is comparable to the baseline when Uns not replaced, but there is not as much improvement when doing so. The authors found that K is inversely correlated with t.
Conclusion
Using the importance sampling an approach was proposed to be used in machine translation with a large target vocabulary without any substantial increase in computational complexity. The BLUE values for the proposed model showed translation performance comparable to the stateoftheart translation systems on both the English→French task and English→German task. On English→French and English→German translation tasks, the neural machine translation models trained using the proposed method performed as well as, or better than, those using only limited sets of target words, even when replacing unknown words.
Bibliography
<references />