Article by Paulien Lemay
How can deep learning help media makers create content tailored to the reader's preferences? In NewsTAPAS, a project on the personalisation of news, we discovered the need for strengthening automatic capabilities to better summarize articles. Therefore, we started working with Natural Language Processing (NLP) and compared recurrent neural networks (RNN), long short-term memory (LTSM), Transformers and their applications within media. In this article, we compile some insights and recommendations.
1. Recurrent neural networks (RNN)
Consider the following sentence.
Jef is thirsty, so he orders a beer.
When you read “he orders a beer”, you remember that he is doing that because he is thirsty. You read it that way. If we want a neural network to learn about logical correlations in language, we want it to remember the first part of the sentence when it is reading the second part.
A language model needs to remember context. While the network is training, it will constantly correct itself. Maybe the first time, it will run its calculations and predict that Jef will order a coffee, a banana or even a refrigerator. When it realizes it is wrong, it will tweak the previous blocks, so that next time, it’s more likely to predict “beer” instead.
For each word, compare the predicted outcome with the actual outcome.
But what if we change the text like this?
Jef and his friends went for a long bike ride. The sun was shining and there was little wind, so it was a perfect day to set some new records. Now they are thirsty, so they order beers.
An RNN is not very good at remembering things we talked about a long time ago. So when we are talking about “they”, it might not remember any context.
Metaphorically speaking, we could tell the two final blocks (“they” and “are”) that grammatical subjects are important. They will pass this information via the previous blocks all the way back to the start, so that every block in the chain learns that it cannot discard subject-related information. However, in this process, two things can happen.
Vanishing gradient: For every step backwards, the algorithm can forget some information. By the time we are all the way back at the start of this text, the algorithm might no longer remember that subjects matter. This means that at a certain point, the RNN just stops learning.
Exploding gradient: For every step backwards, the algorithm might pay more attention to the importance of the error it made. By the time we get to the start of the text, the algorithm might be overreacting.
2. Long short-term memory (LSTM)
An LSTM selects which information should be kept. LSTMs are RNNs that have some sort of memory. For every step it takes, the network will decide what is important to keep in mind. So when a cell receives “long” from the previous cell and “bike”as a new word, it has to decide if it wants to keep both, forget one, or if it can forget about other things it has in its memory. This way, an LSTM is better than a regular RNN when it comes to remembering larger context.
3. Seq2Seq architecture
When we train an LSTM or an RNN, we (try to) end up with a model that knows about context. It knows how likely words can appear together.
However, we can do more by chaining two LSTM/RNN networks together. Say, for example, translation. We can feed everything our first LSTM (the encoder ) has learned, to a second LSTM (the decoder ). In between, we add a tag that marks the change in language. This way, just like the model knows that “Jef” is likely to be followed by “beers”, it will know that “Jef and his friends… beers” + <FRENCH MARKER>, is likely to be followed by “Jef et ses amis…”.
Transformer models offer a solution to the problem of processing words sequentially. In the previous examples, we could not start processing “friends” until we received “Jef and”. This is computationally inefficient and slows down training larger models.
Transformer models take larger chunks of information and process words in parallel. For example, they could take in “Jef and his friends”. When the Transformer receives this sentence piece, it calculates the relations between all words. So for example, it takes “his” and checks how strongly “his” is related to both the preceding and the following words. This means we are no longer dependent on word order to estimate correlations between words. This is called the attention mechanism.
This attention mechanism is often part of a seq2seq model, which then looks like this:
For every sentence we feed the encoder, we calculate the attention multiple times. Between every attention layer is a feed forward layer, which prepares the data to feed them to a new attention layer. After we went through the entire encoder, we end up with a language model that knows Jef and his context even better, since it is not dependent on word order.
If we want to do more with this knowledge, we can feed it to a decoder. Attention in the decoder is slightly different. In the decoder, we do hide the next words, so that we can force the decoder to think for itself what the next word could be. This way, our decoder will grow particularly good at next word prediction.
Decoder: Based on everything you know about word correlations, what is likely next?
As you can see, the decoder also has an encoder-decoder attention layer. This is where we compare the source (English) sentence to the target (French) sentence. If two words are tightly coupled in English, they are likely to be coupled in French as well. Just as we did in the encoder model, we feed the output of the attention layers to a feed forward layer to be able to feed it to a new attention sequence.
Different algorithms use this Transformer idea in different ways. Maybe you don’t need a translation, because you just want to know everything about Jef‘s context. In that case, an encoder would do (ex. BERT). Or maybe, you just want to focus on predicting the next word in a sentence. In that case, a decoder might be enough (GPT2).