• Vipul Vaibhaw

Combining Supervised Learning with Reinforcement Learning

This blog post is a summary of "End-to-end LSTM-based dialog control optimized with supervised and reinforcement learning" paper.


The key idea behind this paper is to train a bot("dialog system") using supervised learning and reinforcement learning.

The paper cites a beautiful example which explains how a human would train another human in the task of customer service. First the experienced teacher would explain basic controls(like looking up a customer’s information, some business rules like confirming customer identity or reading a message etc.) to student("agent" in reinforcement learning world). Second, The student would pay close attention to "good" dialogs from the teacher. Third, Student would take calls and imitate teacher, the teacher would be monitoring the student. Here teacher would also provide constructive feedback which will help the student to perform better. Finally, teacher would disengage and our student(aka agent in reinforcement learning world) starts taking calls and would improve based on its own experience.

The authors of this paper try to emulate similar training methodology using RL and supervised learning.

Experiments show that SL and RL are complementary: Supervised Learning alone can derive a reasonable initial policy from a small number of training dialogs; and starting Reinforcement Learning optimization with a policy trained with SL substantially accelerates the learning rate of RL.

The similar training pipeline can be observed for previous versions of alphago. The algorithm used to learn by human examples and then start competing with itself to improve further.

Authors claim in paper that the neural network can be re-trained in under one second, which means that corrections can be made on-line during a conversation, in real time on standard PC without GPU.

Model description

In short, The developer here provides relevant features depending on his use case to the recurrent neural network. The recurrent neural network action with the highest probability is always selected, if RL is not active. If RL is active the it does "exploration" and action is sampled from the distribution.

The authors have used LSTM because it can remember past observations arbitrarily long, and has been shown to yield superior performance in many domains. The "action mask" in the figure above denotes whether action is currently available or not. This information is also passed into the LSTM although it is not shown here in the picture. A Renormalization is applied to convert resulting vector into probability.

In the picture above the entity extractor extracts the "name" which is Jason Williams in the example. Then it checks whether the information is present in the database or not, if it is present then that information is passed as a tensor to LSTM.

If the output is API call, like connect the call then the loop ends. If the output is text then the whole process repeats again with the updated state of LSTM.

Note - For the LSTM, they selected 32 hidden units, and initialized forget gates to zero

Optimizing with supervised learning

A dialog is considered accurate if it contains zero prediction errors. Training was performed using categorical cross entropy as the loss, and with AdaDelta to smooth updates (Zeiler, 2012).

After a single dialog, 70% of dialog turns are correctly predicted. After 20 dialogs, this rises to over 90%, with nearly 50% of dialogs predicted completely correctly. While this is not sufficient for deploying a final system, this shows that the LSTM is generalizing well enough for preliminary testing after a small number of dialogs.

This paper also explores whether the model is useful for active learning. The goal of active learning is to reduce the number of labels required to reach a given level of performance.

Re-training the LSTM requires less than 1 second on a standard PC (without a GPU), which means the LSTM could be retrained frequently. Taken together, the model building speed combined with the ability to reliably identify actions which are errors suggests our approach will readily support active learning.

Optimizing with reinforcement learning

Once a system operates at scale, interacting with a large number of users, it is desirable for the system to continue to learn autonomously using reinforcement learning (RL). With RL, each turn receives a measurement of goodness called a reward; the agent explores different sequences of actions in different situations, and makes adjustments so as to maximize the expected discounted sum of rewards, which is called the return. We defined the reward as being 1 for successfully completing the task, and 0 otherwise. A discount of 0.95 was used to incentivize the system to complete dialogs faster rather than slower.

To optimize we use Policy Gradient Approach. If RL fails to reconstruct the training ste, we switch to supervised learning during training time. Note that this approach allows new training dialogs to be added at any time, whether RL optimization is underway or not.

This paper has taken a first step toward end-to-end learning of task-oriented dialog systems.

Feel free to ask questions, reach us at vipul[@]chanakyaschool[dot]ai .


Recent Posts

See All

©2019 by Deeplearned education pvt ltd