Why (and How) Explainable AI Matters for Chatbot Design

--

When the goal is to design chatbots that are “honest and transparent when explaining why something doesn’t work,”* how do we — practically — do that?In practice, what doesn’t work, and why, can be difficult to find out — let alone explain to the end-user.

Get the PDF of our position paper for a recent workshop on AI and HCI at CHI2019: https://arxiv.org/abs/1905.03640

At jobpal, explainability means actionable and understandable feedback that helps project managers iterate on a chatbot during design, implementation, and maintenance, and especially improve data quality.

Earlier this year, we participated at CHI2019 in Glasgow, in a workshop on AI and Human-Computer Interaction (HCI) to learn about the most current approaches in Explainable AI (“XAI”) and related topics that might help us to build better, more transparent NLP/ML-based chatbots.

For the rest of this post, I will focus on the take-aways from the conference about Explainable AI, and apply these ideas to two of the three possible sources of error above: data quality and NLP/ML system improvement.

What is Explainable AI?

Most of the specific cases and application discussed were about machine learning ( system learns through examples), and some were reinforcement learning ( system learns through feedback). The term “AI” or “AI systems” is useful to refer to a range of systems that use human input and improve over time; but, in this case, it does not mean “general AI”. In all cases, we talked about very concrete forms of process automation or partial automation.

When you use our own careers bot, the questions you ask are classified based on an NLP/ML dataset we maintain. We provide the answers to many question topics; those questions that fall outside the known topics will get forwarded to a person.

In the case of jobpal’s recruitment chatbots, the part that is automated using machine learning is question-answering. The system improves over time as people train it, and improve the quality of the example data.

Many of the “Explainable AI” papers presented at CHI2019 used the term AI this way: it was an AI component that had to be made explainable to the user, but often the technology behind it was ML. For example, Microsoft Research researchers Kocielnik, Amershi, and Bennett in their paper “Will you Accept an Imperfect AI? Exploring Designs for Adjusting End-User Expectations of AI Systems” looked at how end-user acceptance of a meeting scheduler suggestion based on emails was affected by surfacing information about the suggestion:

Read the full paper here: http://saleemaamershi.com/papers/chi2019.AI.Expectations.pdf

One finding that particularly stood out to me was that “contrary to expectations from anecdotal reports from practitioners, that user satisfaction and acceptance of a system optimized for … [a] system that make more False Positive mistakes (High Recall)… can be significantly higher than for a system optimized for High Precision. We hypothesize that since users can easily recover from a False Positive in our interface (highlighting can be ignored) than from a False Negative (no highlighting and therefore more careful reading as well as manual scheduling of the meeting required) that the optimal balance of precision and recall is likely in part a function of the cost of recovery from each of these error types.”

When it comes to chatbots, what is the cost of recovery from different types of errors? In this case, a “false positive” could be something like a wrong answer given to a question; and a “false negative” could be deferral to a human even though the question actually is in a dataset. Recovery from a false positive might mean that the end-user rephrases their question; recovery from a false negative might mean that the end-user has to wait to get an answer from a recruiter, which can take some time (half of applicants approach recruitment bots outside of working hours).

One of the paper authors, Dr. Amershi also co-authored Guidelines for Human-AI Interaction with Microsoft Research colleagues, which also stresses that an AI system must “make clear why the system did what it did” (point 11). As a vendor, we have to think about the explanations we provide during the course of iterative design and implementation to all project stakeholders.

Problem Solving versus Problem Setting

Who is the target audience for explanation depends on the AI system context and problem domain. When the target audience for explanation is the end-user, they encounter the AI system briefly, and the explanation informs how they interpret a short interaction. The workshop’s keynote speaker Dr. Wortman Vaughan discussed some of the findings from prior work on trust in models among laypeople, and how it’s affected by presenting “stated accuracy.” They found in a sequence of large-scale controlled trials that the system’s stated accuracy influenced perception, but as people used the system and observed its accuracy in practice their own estimations were much more important than what the system stated.

When I work on explanations of the ML/NLP systems at jobpal, the target audience is other stakeholders in the design an development process, both internally and in external teams we work with. The main distinction between the layperson case above, and the case relevant to B2B chatbot building, was referred to in the workshops as “problem solving versus problem setting” — in other words, the stakeholders in the process of chatbot building need to understand the AI system well enough to propose and enact changes to it, based on their own respective agendas.

Figure from our position paper that shows the many responsibilities of project managers at jobpal to bridge external stakeholders and internal technical staff, using a variety of shared documents, such as data quality guidelines.

One of the challenges I described initially was data quality. As an engineer and researcher, I want high-quality data for interesting experiments and exciting new features — but, as an engineer and researcher, I have no influence on it. In the world of data science, data quality is a major topic, and data cleaning — a time-consuming, necessary, and not particularly fun activity.

The presentation and paper “Gamut: A Design Probe to Understand How Data Scientists Understand Machine Learning Models” presented a system that was used to explore the same real estate test problem mentioned above in the paper by Dr. Wortman Vaughan but among a different audience of people with some amount of prior exposure to data science or ML. The creators of GAMUT found that their interactive tool was used not only to explore and understand the AI system, but to dig into data quality:

  • Hypothesis generation. As participants used Gamut, they constantly generated hypotheses about the data and model while observing different explanations.
  • Data understanding … while a predictive model has its own uses, e.g., inference and task automation, many participants explained that they use models to gain insight into large datasets”
  • “Communication. … nearly every participant described a scenario in which they were using model explanations to communicate what features were predictive to stakeholders who wanted to deploy a model in the wild.”

There are concrete design decisions that can help an interactive system These observations were consistent with one of the main immediate benefits of making our AI system explainable internally: improved data quality from day one on projects, resulting not only in better chatbots, but in the opportunity to experiment with ML system improvements in the core product.

The Measure of Success

I’ve sketched out in the prior sections what explainable AI is, and who is its audience. When it comes to building B2B chatbots, “explainable” means understandable and actionable, and the audience includes the variety of stakeholders involved, on the side of the vendor as well as in the team that is ultimately responsible for chatbot content. When the AI system is understandable and actionable to this audience, they can make informed decisions an actions on not only the NLP data, but also the surrounding conversational flow UX, to provide a better experience to the end-user.

But what does it mean for an AI system explanation to be successful? Workshop participants identified three different, complementary ways of looking at measures of success. First, the person being explained to has a sense of understanding. Second, if the person puts their understanding into action, it will have the expected effect. Third, the explanation is true to how the AI system functions. The second, practical concern is already difficult to articulate and measure in an experimental setting; the third — even more so.

I started this article by asking what is practically needed to design chatbots that are “honest and transparent when explaining why something doesn’t work.” Explainable AI (and related terms) can refer to ongoing research at the intersection of Human-Computer Interaction (HCI) and AI that tries to provide some explanation for AI systems. These systems are not “general AI” but typically ML (with some variation) components of other complex systems. In the case of building recruitment chatbots, ML and NLP inform just one of several aspects of custom chatbot projects.

*To be “honest and transparent when explaining why something doesn’t work” quotes Actions on Google’s design guideline document, and was actually part of the DialogFlow manual since at least 2017 when I first started quoting it. This is basically a good idea and I’ve seen a version of it in almost any design guideline relevant to chatbots, and is deceptively obvious: of course, we can all agree! But then what? It was great to start this conversation internally and draw from so much existing and interesting work, and there is a great deal left to explore in this space.

--

--