Rethinking ChatBots Testing and Evaluations

Be it be any chatBot now, all chatbots need testing for its fundamental use that is conversational answering with certain level of details. The testing of a chatBot should be not just on NLP Tasks, such as question answering, information retireval, summarization and so on. But also on random test cases for working of chatBots such as BARD, chatGPT and so on. Why do we need independent task of chatBots to be tested? This is because you scored well on question answering, but chatBot is more than that. It’s not one question and one answer, this is a conversation and whole conversation can have multiple kinds of questions and multiple answers that can be generated in each run.

There can be two kinds of testing for chatBots

  1. Primary Testing, this is testing chat conversation with is the primary task of ChatBot.
  2. Secondary Testing, this is testing other NLP, CV tasks using chatBots such as testing TREC dataset using chatBot.

It is required to test chatBots in a more comprehensive way. It’s not about who have tested the chatBot, as Google published 80000 employees have tested its chatBot. But it is about what has been tested, what were the testcases, what was the accuracy of system, and why not make such datasets public to be tested by all chatBot to release its results. The subsidiary models that are tested can be presented too. So its not 80000 employees to blame for a error in its chatBot, but on missing protocols of first the development and then missing guidelines on testing, it seem like this only.

When a chatBot is tested with standard datasets for chatBot conversations one must provide answers that are Gold standard for this work. Then the next step is to provide and compute the accuracy, precision and recall on these conversations. We need some baselines, we need some measures to test the systems so made. These systems are tested on secondary tasks, while the primary task is chat conversations.

The amount of manual efforts that has gone in building standards for translations, prompts, subsidiary testing on secondary tasks and editing data is enormous, so why not put some work in testing of chatBots with chats of all kinds, with Gold answers being provided in the dataset to be evaluated for conversations in chatBots. This is the way of doing primary testing, this can always be appended with secondary application testing where in chatBot is used on some tasks to solve traditional NLP problems.

