Current post is the extended text version of the speech at AI Tonight @ Rails Reactor, 24 November 2017.


The task by itself is simple, to generate meaningful responses and support general conversation. It is building block for many practical applications, such as chatbots and clearly, the main motivation is to a make better product by using conversational agents. This natural language processing task is under active research, nonetheless, it is still in the earlier stages.

Having the context (previous utterances) the model has to generate, in fact, predict the natural reply. The context could contain dialog history, information about speaker, some additional knowledge. To tackle this issue two approaches could be used:

  • Generative, when a model generates the most probable reply given the context

  • Ranking, when a model scores predefined answers by using similarity function and chooses the best one

In this post, I consider the generative approach that is based on applying deep neural network in an end-to-end fashion.


Let’s overview the most popular datasets used in research. First one is OpenSubtitles[1], it was created based on the movie subtitles files (*.srt) and is widely used as an open-domain dataset. Despite the huge amount of conversations and topics, the dataset has two disadvantages. First of all, we can’t track the dialog turn because we simply don’t know where the speakers start and stop talking. Secondly, the dataset contains a lot of useless information about vampires, dragons and etc. However, it is worth to admit that OpenSubtitles doesn’t contain misspelled words that could be considered as a benefit for preprocessing. Also, the 0.45% sentences contain the sequence “I don’t know”. It’s a high rate considering a huge diversity of dataset and usually result in common answers generation issue.

The second dataset is Twitter. Beloved by many researchers in a different domain such as sentiment analysis, text classification and etc. It certainly benefits that we can track the dialog flow and in advance collect data about the speak via parsing user profile. However, it should be kept in mind that a lot of twits are based on link or photo discussing and usually the text without correspondent resource doesn’t make any sense. In advance, it is a rear case to see a natural conversation.

The third dataset is Ubuntu[2] . The data is collected from Ubuntu forum and contains huge number of problem-solution discussion. However, it is domain specific data and basically all the discussions are around questions and issues in Ubuntu operation system. Moreover, dataset contain huge amount of link, commands and scripts.

Both, Twitter and Ubuntu, datasets contain slang, misspelled words and emoji what is critical for real life solutions but require additional preprocessing and some tricks for a model training.

Converational Models

For going further with the article it is recommended to be familiar with Recurrent Neural Network - please read more on wildml blog post, Andrej Karpathy post or just google it :).

Generation sentence based on a sentence was deeply explored in machine translation. The default approach is to use Sequence-to-sequence model[3]. The idea is quite simple. You have encoder-decoder architecture, where encoder “reads” the sentence and decoder generate the answer taking into consideration the information from the encoder. Later on, this approach was adopted in neural conversation models by Vinyals, Le Q and co[4].

Seq2seq Architecture

Seq2seq Architecture (source)

This sequence-to-sequence model was trained on two different dataset OpenSubtitles (was mentioned above) and IT Helpdesk troubleshooting dataset that is not publicly available. The description and results on both datasets are in the table below.

  IT Helpdesk OpenSubtitles
Domain Common 3 problems (remote access, software crashes, password issues) Movies
Preprocessing Common names, numbers, and full URLs were removed XML tags and hyperlinks were removed
Training set 30 million tokens 923 million tokens
Validation set 3 million tokens 395 million tokens
Model 1024 LSTMs 4096 LSTMs + 2048 projection unit
Vocabulary size 20 000 words 100 000 words
Perplexity 8 17
N-gram perplexity 18 28

The perplexity can be seen as the average of choices per word in the reply. So the higher perplexity is, the more “choices” model has, then the more random is the predicted word. Models with lower perplexity have probability values that are more varied, and so the model is making “stronger predictions” in a sense. perplexity could be written as a weighted geometric average of the inverses of the probabilities

It captures the degree of ‘uncertainty’ a model has in predicting (assigning probabilities to) some text (more detailed explanation and wiki). Also, it worth to admit that attention mechanism for OpenSubtitles dataset didn’t improve the perplexity.

Let’s check the model results on two examples from the paper “A neural conversational model”[4]

Machine: hi , this is <NAME> from helpdesk
connect , what can i help you with today ?
Human: hi , i forgot my password
Machine: ok great
Human: can you tell me how i can recover it ?
Machine: i ’ ll need to verify who it is at that machine
Machine: can we do a hangout?
Human: yes
Machine: <anonymized hangout instructions and verifications>
Human: cool , i am good now
Machine: how ’ s this looking good ?
Machine: i ’ ll go ahead and close this chat .
Machine: take care . : )
Human: bye
Machine: bye .

And another example

Human: what is the purpose of life ?
Machine: to serve the greater good .
Human: what is the purpose of living ?
Machine: to live forever .
Human: what is the purpose of existence ?
Machine: to find out what happens when we get to the planet earth .
Human: where are you now ?
Machine: i ’m in the middle of nowhere .

The results look very promising. Let’s take a closer look on objective functions. For training, the objective is to maximize the probability of a reply given a context and a word

So the conversational model learns the most probable next word in reply for concrete context (history of a dialog) and currently generated word. Thus, for reply generation, we use the same objective function but maximize it given the context and reply by beam search. Basically, in order to generate the next word of reply we choose the k words with the highest likelihood, add them to the previously generated word sequences to form new hypothesizes. Then we score our reply hypothesizes and select the most probable ones.

The loss function for model training is still cross entropy. Here you can read the great explanation about the connection between maximum likelihood and cross-entropy as a loss function.

The current approach could be improved by using attention mechanism, more layers for encoder and decoder, bi-directional layer, different penalties, and strategies for beam search[5] and etc. Generally talking, all the techniques applied for machine translation[6] are potentially applicable to a conversational sequence-to-sequence model. There is open source implementation called OpenNMT.

Although the conversational models face also different issues that are not considered in machine translation task.

Common Answers

The first challenge is called common answers. For example reply “I don’t know” is widely relevant to almost any question. Thus during training, the model converges to common replies that will have lower loss value than more specific ones across the whole dataset. Take a look at the example below.

Common answers issue

Common answers issue (source)

In order to solve this issue was introduced a diversity-promoting objective function[7]. The idea was simple and elegant, they add the maximum mutual information. The new objective function should take into consideration the probability of restoring the context from the response.

So now the conversational model tries to choose the most probable reply but such as from this reply the original question could be restored. According to authors they archived increase of BLUE score up to 36% and more than 200% jump in unigram diversity. However, this approach requires to have two seq2seq models, thus we need to generate reply by context and also context by reply, of course, it has a negative influence on performance.

Context Sampling

In order to overcome the limitation of a mutual information approach, in Generating High-Quality and Informative Conversation Responses with Sequence-to-Sequence Models paper[8] introduce another modification to objective function. Instead of calculating the probability context given generated reply we just calculate the probability of generated reply given random contexts. Basically, it is Monte Carlo evaluation for maximum mutual information.

So having the generated answer we also calculate how likely to get the same reply given random ith contexts, thus this approach doesn’t require to train another seq2seq model.


Question: What do you have planned tonight?

Seq2seq: I have no idea what I want to do with my life.
         I have no idea what I want to
Human ratings: Bad, Mediocre, Bad

MMI: I was planning on doing a few of these, but I’m not sure
     what to choose. I need to find a place to camp, and
     I don’t have a lot of time to make a plan.
Human ratings: Excellent, Good, Mediocre

And another example

Question: What do you hope to see in the future?

Seq2seq: I want to be able to go back in time to when I was born.
         I want to be
Human ratings: Mediocre, Bad, Mediocre

MMI:  I want to see an actual movie where the characters have the
      ability to hide the fact that they are actually in the movie
Human ratings: Good, Acceptable, Acceptable

As we see, the propose approach trigger seq2seq model to generate more specific and diverse replies.

Inconsistency in the Answers

Another challenge is the consistency of the answers. Generally talking, the model has no idea about entity it represents (he or she, doctor or engineer and etc). The example below clearly shows that a model could give you different answers to the same paraphrase question. I’m sure you would definitely prefer to have a chatbot that will consistent in it’s answers.

Human: what is your job ?
Machine: i ’m a lawyer .
Human: what do you do ?
Machine: i ’m a doctor .

Human: Where were you born?
Machine: I was born in Canada.
Human: Where are you from?
Machine: England, you?
Human: Where did you grow up?
Machine: I grew up in Texas.

The solution was proposed in paper A Persona-Based Neural Conversation Model[9]. As we training and different dialogs we could modify the model to incorporate the information about speaker. This is done by adding a speaker embedding vector that corresponds to the user of the current reply in the decoding stage. This allows us to condition the generate replies on the speaker and thus learn user specific reply. So now, when Seq2seq model generates the next word in the reply it takes into consideration not only previous word and context but also the speaker embedding vector as shown on the picture.

A Persona-Based Neural Conversation Model

A Persona-Based Neural Conversation Model (source)

Example shows that diverse replies were generated for the same question but for different user vector.

Q: What company do you work for?
Baseline: I work for a company.
User1: I don’t have a job.
User2: I’m a manager.
User3: I’m working.
User4: I work for a company.
User5: I’m working.
User6: Customer service.
User7:  I work in retail.
User8: I work in a cafe.
User9: I work at a gas station.
User10: I don’t know

This example show that replies are consistent within dialog for concrete user.

Q: Where is your hometown?
A: I’m from England.
Q: Where are you from?
A: I’m from England.
Q: In which city do you live now?
A: I live in London.
Q: In which country do you live now?
A: I live in England.
Q: What is your major?
A: Business. You?
Q: What did you study in college?
A: I did business studies.

Lack of knowledge

Next challenging problem is that our conversational model has no idea about general facts. As an example, the model was trained on a dataset from 2016 and the president of the USA was Barak Obama. Consequently, the model has no idea that a new president was elected.

The research team from the Information Sciences Institute and Microsoft Research proposed to use facts encoder[10]. Given the context (conversation history) the sequence encoder is applied. In parallel, the relevant facts are retrieved by using simple keyword-based information retrieval engine and feed into facts encoder. Next step is to join output of two encoders, the new context-fact representation is feed into decoder to generate a reply. Basically, they add separate facts encoder to seq2seq model. Note that “world facts” is just a large collection of text entries from Foursquare, Wikipedia or Amazon reviews. That’s why, the fact is nothing more than simply a snippet of text and may contain subjective and inaccurate information.

In order to answer the question “Who is a president of the USA” we need to carefully create and update a collection of facts or knowledge graph.


The last but not least problem is lack of proper metrics. Currently, for evaluation is used mostly the same metrics as for machine translation such as: BLEU, ROUGE, METEOR and etc. In the paper “How NOT To Evaluate Your Dialogue System”[11] the authors show that mentioned metrics do not correlate with human judgments. In fact, this due to the fact that these metrics are based on the n-grams. That’s why during evaluation the answers “Please restart your laptop” and “Don’t forget to reboot the PC” are considered as dissimilar despite the fact that they have the same meaning.

Correlation metric table

Correlation metric table (source)


So far so good. The neural conversational model is an active area of research, however, the existed approaches are not ready for the open-domain tasks in business. On the other hand, with the proper dataset for a close domain we are ready to implement intelligent chatbots and dialog systems. Please leave questions in the comments.


  1. P. Lison and R. Meena, “Automatic turn segmentation for Movie & TV subtitles,” in Spoken Language Technology Workshop (SLT), 2016 IEEE, 2016, pp. 245–252.
  2. R. Lowe, N. Pow, I. Serban, and J. Pineau, “The ubuntu dialogue corpus: A large dataset for research in unstructured multi-turn dialogue systems,” arXiv preprint arXiv:1506.08909, 2015.
  3. I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning with neural networks,” in Advances in neural information processing systems, 2014, pp. 3104–3112.
  4. O. Vinyals and Q. Le, “A neural conversational model,” arXiv preprint arXiv:1506.05869, 2015.
  5. M. Freitag and Y. Al-Onaizan, “Beam search strategies for neural machine translation,” arXiv preprint arXiv:1702.01806, 2017.
  6. Y. Wu et al., “Google’s neural machine translation system: Bridging the gap between human and machine translation,” arXiv preprint arXiv:1609.08144, 2016.
  7. J. Li, M. Galley, C. Brockett, J. Gao, and B. Dolan, “A diversity-promoting objective function for neural conversation models,” arXiv preprint arXiv:1510.03055, 2015.
  8. Y. Shao, S. Gouws, D. Britz, A. Goldie, B. Strope, and R. Kurzweil, “Generating high-quality and informative conversation responses with sequence-to-sequence models,” in Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, 2017, pp. 2210–2219.
  9. J. Li, M. Galley, C. Brockett, G. P. Spithourakis, J. Gao, and B. Dolan, “A persona-based neural conversation model,” arXiv preprint arXiv:1603.06155, 2016.
  10. M. Ghazvininejad et al., “A knowledge-grounded neural conversation model,” arXiv preprint arXiv:1702.01932, 2017.
  11. C.-W. Liu, R. Lowe, I. V. Serban, M. Noseworthy, L. Charlin, and J. Pineau, “How not to evaluate your dialogue system: An empirical study of unsupervised evaluation metrics for dialogue response generation,” arXiv preprint arXiv:1603.08023, 2016.

If you have any question, remarks or found mistake please contact me