How to Measure and Improve Automatic FAQ Answers

--

When we start with a new recruitment chatbot project that includes the FAQ automation feature, we use a Starter Set of questions, which are then enriched with FAQS relevant for the client before go-live. After go-live, the system is put to the test with real usage, and improves over time with training.

When I say “Starter Set,” I mean something like sourdough starter: it may not look like the end goal, but it is essential for success. In a good automated FAQ system, robust algorithms are crucial; however, the dataset, especially the dataset that is often (in the world of chatbots) developed before go-live and with limited exposure to real data, can have a big influence. What the chatbot knows and doesn’t know on Day 1 already shapes how candidates interact with it, and influences its learning over time. Fortunately, we have experience with many multi-year live projects, which allows us to find consistent trends in FAQ topics relevant to candidates. These trends have guided the development of the Starter Set v1.0 in 2018, and its revision this summer.

How can the chatbot know more from Day 1?

How were we able to substantially improve our Starter Set? We used a combination of automation and manual review, to grow our initial v1.0 set into from ~1K questions over 47 to ~3–4K questions (in German and English, respectively) over 68 categories. The new, v2.0 set reflects patterns in 130,000 questions asked by real candidates in multiple live projects over a year, and has been thoroughly anonymized. In this post, I’ll answer:

  1. What is the difference between accuracy and automation, and why does it matter?
  2. How do these measures help guide the iterative process of creating a good, clean starting dataset for an FAQ chatbot?
  3. When we went from 47 categories to 68, we both added and removed categories; how did we decide what to change, and what was the role of automation and manual review in this process?
Figure 1: Summary of improvement in Starter Set v2.0, compared to v1.0. Automation and Accuracy axes have been truncated for readability; all performance numbers are based on multiple trials, and on test data not used during set development. Details of the accuracy measurement method are described in a previous publication.

Crucial to this human-machine collaborative data development process is how FAQ answer performance is measured. We use two measures: automation and accuracy, which are related, but distinct. When it comes to an FAQ chatbot in recruitment, not every incoming question is a frequently-asked question; to the key insight is to recognise that not everything should be automated, and to measure not only the capacity to automate as much as possible as well as possible, but also the capacity to correctly decline to answer something that is outside of the FAQ dataset. We developed a measure of accuracy, nex-cv (or: cross-validation with negative examples) that is especially useful, and is described in a previous publication.

However, when it comes to the Starter Set, it is not the only important measure for the Starter Set. It is an internal estimate of accuracy, and v1.0 already had a relative high accuracy, compared to typical, live FAQ sets — in part because more categories makes high data quality difficult to maintain. Therefore, we also use automated response as the main measure of success: the goal was to improve the coverage of the Starter Set. In other words, how can the chatbot know more from Day 1?

What is the difference between accuracy and automation, and why does it matter? In the example performance in Figure 2 below, automation would refer to the part of each pie that isn’t “No Response” / “Not FAQ.” Meanwhile, accuracy is something independent from the pie chart: it is a comparison to the ground truth, and includes the case where the ground truth label might indicate that no automated response is the best response. In the case of the Starter Set, a high accuracy is a vital, difficult pre-requisite; but increasing automation will help us expand the chatbot’s knowledge prior to go-live.

Ground truth: given a question, the recruiters’ judgment of what the right answer category is — as opposed to an automated guess.

With Starter Set v2.0, we see that the top 20 topics in each language cover nearly three quarters of the questions in the test data (Fig 2). At a high level, these topics are similar to the main topics in v1.0: details of the application process, and questions about qualifications and benefits. The differences are subtle, but important.

During development, we aimed to improve coherence: split categories which contain questions that often end up in different categories in practice.

For example, between v1.0 and v2.0, one category that used to be about the company language (e.g, English or German) was split into two: (1) about the company language, and (2) about the language acceptable in the application. This split allows each individual category to perform better; and because both of the language categories are part of the top 20, it has a surprisingly big impact on the capacity of the dataset as a whole to handle the test set.

Summary of responses to the test data, using the v2.0 set. In this case, the “No Response” section means no response was found; “Not FAQ” was excluded because it fell into a different feature, like small talk. The rest cover topics that, at a high level, stay very similar to the main topics in v1.0. Although the Top 20 topics cover a large portion of the data, the overall high performance would not be possible without the other 48 topics that cover the “long tail” — shown here in the “All Others” section.

So, how do these measures help guide the iterative process of creating a good, clean starting dataset for an FAQ chatbot?

We had a total of 130K questions, over 2 languages, and in the very beginning we split these into a test set and a development set. The test set for EN was 14K, and for DE 30K; this test set was not used at all until the end. In any data-driven project where you are experimenting with the structure of data or the algorithm, it is essential to leave out a test set. All the numbers reported (in Figures 1 and 2) reflect the results on v2.0 sets after they were completed. This allowed testing with unseen data.

The development set, on the other hand, was used extensively and repeatedly for deciding which categories stay in the Starter Set, and which do not. For v1.0, this process was based on automated unsupervised clustering and manual review, but in 2.0 it was significantly more data-driven, and started with v1.0. Each language had 8–10 distinct iterations, at every round adding and removing categories.

How did we decide what to change, and what was the role of automation and manual review in this process? Each iteration went like this:

  1. Grow the dataset, but automatically accepting very confident guesses
  2. Automatically suggest changes in categories to (1) improve coherence: split categories which contain questions that often end up in different categories in practice and (2) reduce unrealistic overestimation: remove or reduce categories that appear more in the training data than in real, incoming questions.
  3. Manually review newly-added questions, including anonymizing them, and the suggested changes. Especially in the first iterations, the suggestion list is very long, so these are prioritized by performance at the level of each category (F1 score, a common measure)
  4. Check measures of success: accuracy (nex-cv) and auto-response — still using a test set from within the development set, and only use the held-out test sample at the end.

This process was repeated until it was no longer possible to grow: for example, at one point, the EN v2.0 dataset had nearly 10K questions, but through anonymization and review, this was reduced to about 3K, with the same high accuracy and automation. The role of the automation is to greatly speed up the manual review; however, manual review is crucial because:

  1. All data must be fully anonymous, if the Starter Set is to be used for new projects. This means that individual questions cannot contain any non-anonymous data, but also that topics that are highly specific to particular live projects should be excluded.
  2. The categories and topics must make sense, and match between the different languages: although this process takes place over each language individually at first, it is important that there is coherence between localized version of the dataset.

Over the last several years, we have considered many aspects of the human element of chatbot learning and data quality maintenance, and the development of the Starter Set v2.0 was no exception: although guided by optimising accuracy and automation, it was ultimately a collaboration between the development and implementation team. This collaboration enabled the data from real needs of job-seekers and candidates to be contextualized and understood throughout the process.

--

--