Sequence to Sequence Learning with Neural Networks
Ilya Sutskever, Oriol Vinyals, Quoc V. Le
Deep Neural Networks are utilized in an end to end approach to sequence learning while making minimal assumptions about the sequence structure. A multilayered LSTM maps the input sequence to a vector of a fixed dimensionality and a deep LSTM decodes the target sequence from the vector. The LSTM was able to learn phrases and sentence representations that were sensitive to word order and invariant to the voice (active or passive). If the word order of the source sentences is reversed then the LSTM’s performance increases because this introduces short term dependencies between the source and target sentences while making optimizations easier. The order of the target sentences is not reversed.
The main contribution of this paper is this trick of reversing the sentence such that the sentence c1, b1, a1 is mapped to a2, b2, c2 using the LSTM. The mapping rule is {(a1, a2), (b1, b2), (c1, c2)}. In other works, the mapping exists in the usual order such that sentence a1, b1, c1 is mapped to a2, b2, c2. Reversing allows close proximity between (a1, a2), (b1, b2) and (c1, c2). This close proximity allows establishing communication between the input and output by the SGD easier. The system does well on long sentences which might be because of reversing the order of words in the source sentences but not the target sentences in both the training and test set. This introduces short term dependencies making the optimization easier. The SGD learns the LSTM for long sentences.
Deep Neural Networks (DNN) are powerful machine learning models as they can perform arbitrary parallel computation for a modest number of steps. Exemplifying DNNs ability to learn intricate computation, they are able to learn how to sort N-bit numbers using 2 hidden layers of quadratic size. DNNs have a limitation which is that they can only be applied to problems that have input and target sequences of fixed dimensionality. There are many problems that have variable size and whose sizes are not known beforehand. Question Answering is one such task. To convert question answering into DNN solvable, the sequence of words representing the question need to be mapped using a domain independent method into a sequence of words representing the answer. This is made possible using a LSTM architecture which can solve general sequence to sequence problems. LSTMs property of being able to learn an input sequence of variable length and converting it to a fixed dimensional vector representation is exploited here. The LSTMs can successfully learn from data with long range temporal dependencies and the question answering problem presents a time lag between the inputs and the production of their respective output. One LSTM will read the input sequence per timestamp and convert into a fixed dimensional vector representation. Another LSTM will extract the output sequence from it. The second LSTM is a RNN conditioned on the input sequence.
The model presented is a relatively un-optimized small vocabulary neural network architecture which outperforms the phrase-based SMT system. As the task is machine translation, usually the translations are paraphrases of the source sentences and the translation objective for the LSTM is to find another sentence representation that captures the similar meanings. Experimentally this model is aware of word order and invariant to the voice.
blog comments powered by Disqus