Mehdi Kchouk

1 août 2018

Deep Reinforcement Learning for Natural Language Processing

We just came back from the 56th Annual Meeting of the Association for Computational Linguistics that was held this year in Melbourne, Australia and I couldn’t help but notice one thing: almost all the papers presented this year uses deep learning. So is DL really the final act for demystifying all the problems related to intelligence?

I cannot answer that but as this year’s laureate of the ACL Lifetime Achievement Award Mark Steedman puts it:

“Algorithms like LSTM and RNN may work in practice. But do they work in theory? Can they learn all the syntactic stuff in the long tail? If they aren’t actually learning syntax, then we are in danger of giving up in the project of providing computational explanations of language and mind.

At this point you might ask where am I going with this. Well, my point is simple. The NLP community needs some fresh air. This is where the first tutorial of the conference comes in: Deep Reinforcement Learning for NLP presented by William Yang Wang (UC Santa Barbara), Jiwei Li ( and Xiaodong He (JD AI Research).
So what is Deep Reinforcement Learning? An intuitive way of thinking of reinforcement learning is as Christopher Bishop says it:

“Reinforcement learning is concerned with the problem of finding suitable actions to take in a given situation in order to maximize a reward.”

DRL is just a way of using neural networks in RL systems to model the observed environment .

But before going through how Deep Reinforcement Learning or DRL is applied to NLP, here are two honorable mentions:

Notable mention

Long Short-Term Memory as a Dynamically Computed Element-wise Weighted Sum (levy et al., ACL 2018)
In this paper, the authors perform multiple ablation experiments on different cells of an LSTM to show that the memory cell alone is largely responsible for the success of LSTMs in NLP. Additionally, removing every multiplicative recurrence from the memory cell itself leads to performances falling within the standard deviation of the LSTMs on some tasks. This indicates that the additive recurrent connection in the memory cell is the most important element in the LSTMs which the authors refer to as a “cousin” of self-attention. So as Omer Levy says it:
“Maybe it’s not Attention is all you need but attention is really all you have”

The Hitchhiker’s Guide to Testing Statistical Significance in Natural Language Processing (Dror et al. ACL 2018)
Statistical significance testing is a standard tool designed to ensure that experimental results are not coincidental. However, a survey on recent publication in ACL and TACL shows that hypothesis testing is highly disregarded within the NLP community. This paper tries to establish the fundamental concepts of significance testing related to NLP tasks, experimental setups and evaluation measures. It also proposes a practical protocol for statistical significance test selection in NLP setups and accompany this protocol with a brief survey of the most relevant tests.
Sidenote: Although I find this paper very interesting, a lot of criticism has always been around significance testing.

Now as promised,

Deep Reinforcement Learning for Natural Language Processing!

First most of what I’m presenting here is taken from the original tutorial Deep Reinforcement Learning for Natural Language Processing.

March 2016 was a milestone for artificial intelligence. AlphaGo (google’s AI) beat 18-time world champion Lee Sedol in a 5-game Go match and got awarded the highest Go grandmaster rank (honorary 9 dan). Fast forward to ACL 2018, William Yang Wang presents some papers applying DRL to NLP tasks. The list include:

  • Information extraction: Narasimhan et al., EMNLP 2016
  • Relational reasoning: DeepPath (Xiong et al., EMNLP 2017)
  • Sequence learning: MIXER (Ranzato et al., ICLR 2016)
  • Text classification:
    • Learning to Active Learn (Fang et al., EMNLP 2017)
    • Reinforced Co-Training (Wu et al., NAACL 2018)
    • Relation Classification (Qin et al., ACL 2018)
  • Coreference resolution
    • Clark and Manning (EMNLP 2016)
    • Yin et al., (ACL 2018)
  • Summarization
    • Paulus et al., (ICLR 2018)
    • Celikyilmaz et al., (ACL 2018)
  • Language and vision
    • Video Captioning (Wang et al., CVPR 2018)
    • Visual-Language Navigation (Xiong et al., IJCAI 2018)
    • Model-Free + Model-Based RL (Wang et al., ECCV 2018)

In this article, I will only focus on a specific application of DRL which is Dialog but feel free to check the papers cited above.

Deep Reinforcement Learning for Dialog

Traditional sequence to sequence models (Sutskever et al., 2014; Jean et al., 2014; Luong et al., 2015) generally take the following form:

With the following loss function: Loss = – log p(response/message)

A main issue with this architecture is that it poorly handles long-term dialogue success. Here we note two problems:

Repetitive responses:

Assume that two bots are starting a discussion. At some point, assume one of them says something like “see you later!”. The other will answer “see you later!”. To this the first one will answer “see you later!” and so on and so forth till the end of times.

Short-sighted conversation decisions:

A bot asks the other “how old are you ?”. The second answers “I’m 16 .”. The first answers “16 ?” and at this point the conversation goes south when the second bot answers a dull response (preset by the developer) “I don’t know what you are talking about”. The other answers “you don’t know what you are saying”. The conversation then continues with this two dull responses. The problem here is that given context, a human will most probably not answer “I’m 16” to allow a follow up response.

One solution using RL is to establish a reward function that forbids this behavior and use it on top of the seq2seq model depicted above. A first reward that will reduce the likelihood of generating the dull utterance is:
r(response) = – log(dull utterances/response).
The second will maximize the information flow by penalizing repetitive answers:
r = -log sigmoid(cos(s1,s2))
The final reward function will maximize the meaningfulness of the dialog:
r = log p(response/message) + log p(message/response)

Using this type of reward functions or a combination of them, one can simulate a discussion and train the chatbot using the REINFORCE algorithm (William, 1992).

And here are some results:

So one can clearly see that the RL model is giving more human-like answers than the tradition systems.
It’s worth mentioning here that to achieve these results, one had to manually define the reward function. Ji et al. (Ji et al., 2017) propose an adversarial learning procedure to where they train a generative model to produce response sequences and a discriminator to distinguish between the human-generated dialogues and the machine-generated ones. The outputs from the discriminator are then used as rewards for the generator, pushing the system to generate human-like dialogues.

With that, I just want to take the chance to say that this year’s ACL conference introduced some incredible papers along with incredible people all in the beautiful city of Melbourne.

Laisser un commentaire

Votre adresse de messagerie ne sera pas publiée. Les champs obligatoires sont indiqués avec *