People often communicate with auto-answering tools such as conversational agents due to their 24/7 availability and unbiased responses. However, chatbots are normally designed for specific purposes and areas of experience and cannot answer questions outside their scope. Chatbots employ Natural Language Understanding (NLU) to infer their responses. There is a need for a chatbot that can learn from inquiries and expand its area of experience with time. This chatbot must be able to build profiles representing intended topics in a similar way to the human brain for fast retrieval. This study proposes a methodology to enhance a chatbot's brain functionality by clustering available knowledge bases on sets of related themes and building representative profiles. We used a COVID-19 information dataset to evaluate the proposed methodology. The pandemic has been accompanied by an “infodemic” of fake news. The chatbot was evaluated by a medical doctor and a public trial of 308 real users. Evaluations were obtained and statistically analyzed to measure effectiveness, efficiency, and satisfaction as described by the ISO9214 standard. The proposed COVID-19 chatbot system relieves doctors from answering questions. Chatbots provide an example of the use of technology to handle an infodemic.
Artificial Intelligence (AI) enables machines to act independently and intelligently without prior programming. AI learns from continuous interaction with the environment and users. It is of great interest to develop smart conversation agents (chatbots) that interact intelligently with users. Some chatbots interact with appliances and other devices.
The characteristics of smart chatbot systems include the following:
Its ability to interact simultaneously with multiple users reduces the need for service employees; It can work continuously, 24/7; Intelligent conversation systems are psychologically unaffected by customers; It significantly reduces expenses.
Conversation agents are either dialogue systems or chatbots. Dialogue systems perform specific functions such as making flight reservations [
We propose a methodology to mimic the human brain by grouping related topics for ease of classification, fast retrieval, and increased accuracy of chatbots. Specifically, we propose a COVID-19 chatbot.
Coronaviruses (COVID-19) infect animals and humans and may severely affect health. The coronavirus detected in December 2019 caused a worldwide pandemic [
Many countries in the Middle East have begun awareness-raising campaigns focusing on prevention rather than treatment, including tips for dealing with COVID-19 and preventing its spread, fighting rumors around it, emphasizing hand-washing, remaining at home, avoiding crowds, practicing social-distancing, and identifying symptoms [
Governments around the world desire to stop the spread of COVID-19. Increasing awareness of pandemic effects is a high priority. This study investigates the development of a chatbot to respond to coronavirus inquiries and share information and advice to help reduce the spread of the virus. This can efficiently increase awareness, for the following reasons:
Majority of people like using recent technologies; Chatbots reduce anxiety and stress by combating the infodemic of fake news [ Chatbots are available around the clock. Their information is updated quickly and authoritatively. They can interact with thousands of people simultaneously at a low cost; Chatbots reduce demands on healthcare practitioners; The proposed system can significantly reduce the cost of healthcare awareness services; A chatbot merely requires an internet connection, which makes it convenient, efficient, and fast.
The remainder of this paper is organized as follows. Section 2 presents a literature review covering chatbots, particularly in the medical area. Section 3 describes the methodology and implementation of the proposed chatbot. Section 4 presents an evaluation. Section 5 provides concluding remarks and suggestions for future work.
People communicate with each other primarily through conversation. Intelligent conversation agents communicate in this way. Moreover, the recent intelligent conversation agents have convinced the users engaging with the chatbot that the conversational chatbot agent has human-like attributes [
Researchers have experimented with chatbots in areas such as education, health, and business. Their potential in education has been analyzed [
Black et al. [
Researchers have discussed AI techniques to fight infodemics, both directly and indirectly. Twitter is a significant foundation for infodemiology research [
Applications and evaluation measures of health-related chatbots were reviewed [
A study reviewed the role of AI to provide information to prevent COVID-19 infection [
Tanoue et al. [
We completed two research objectives. The first objective was to propose an architecture for a chatbot with a human-like brain profile to improve its response accuracy. The second was to develop a chatbot with a credible knowledge base from World Health Organization (WHO) and the Kingdom of Saudi Arabia (KSA) Ministry of Health.
During data gathering, we retrieved COVID-19 health information from the official websites of the WHO and KSA Ministry of Health (
A medical doctor helped us to order relevant information about each topic of the chatbot repository, as shown in
During the development of the chatbot, we found that to retrieve the appropriate answer to a question could be quite difficult, as questions are short sentences. After tokenization and removal of stop-words, only a few words were left to be manipulated and processed so as to understand the context and find the answer. Although we tried different similarity methods to find the best match to the inquiry, the accuracy of the answer was somewhat questionable, and there was large classification error. We used a methodology and structure to guarantee better accuracy.
A knowledge base prepared at the previous step was preprocessed and converted to a structured format suitable for the chatbot inference engine. The development team transformed this to entities (e.g., places, objects) and intents (what the human should obtain as a response). A doctor clustered the accepted dataset into groups of questions with similar intents, keywords, and phrases (terms).
The output is a number of clusters, each containing a group of questions and related answers with unique IDs. Clustering is a good way to reduce the calculation of the similarity of questions to specific clusters (profiles), instead of to all the questions in the dataset. Each cluster is associated with a list of terms from the cluster profile representing that cluster and distinguishing it from other clusters. These terms are generated by tokenizing the questions in the cluster. The stop-words were removed to keep only the meaningful words and reduce the possibility of counting words with the same meaning (e.g., ‘liked’ and ‘liking’). Then the remaining tokens were stemmed using a porter algorithm to find the root word (e.g., ‘like’). The frequency of stemmed words was counted. Finally, these terms were weight with TF-IDF [
The weight for each term is calculated as
The Levenshtein distance measures the difference between statement texts. The Levenshtein distance between two strings
Our second goal is to help KSA authorities to increase awareness of the COVID-19 pandemic through the chatbot. The structured knowledge base for the chatbot was constructed from information from the WHO and KSA Ministry of Health.
The NLP engine identifies intents and entities from an inquiry, and a list of candidate responses is generated. The response with the highest weight is sent back to the user as the response. Chat history is saved in a MongoDB database [
We evaluated the ability of the chatbot to understand inquiries, and to accurately respond to them in a timely manner. We performed two experiments to evaluate the effectiveness of (1) clustering questions and answers into groups of related topics and contexts; and (2) the proposed chatbot.
We validated the proposed architecture for improving the similarity calculation and matching of questions to answers. Different classification algorithms were used to evaluate the performance of the proposed architecture using different volumes of questions in each round, such as 100, 200, 300, 400, 500, and 600 questions.
To test the accuracy of the classification of answers to questions, we used the COVID-QA dataset on Kaggle (
The dataset was split into training (70%) and testing (30%) sets. We measured performance by Accuracy, Precision, Recall, and F1:
The classification was trained on set QA = {(q1, a1),(q2, a2),…,(qn, an)} of question (qi)-answer (ai) pairs. We used k-nearest neighbors (KNN) and Naïve Bayes classification algorithms on the RapidMiner platform [
The KNN algorithm [
Profile 0 | Profile 1 | Profile 2 | Profile 3 | Profile 4 | |||||
---|---|---|---|---|---|---|---|---|---|
Term | Weight | Term | Weight | Term | Weight | Term | Weight | Term | Weight |
Viruses | 0.0719001 | Coronavirus | 0.0444876 | Virus | 0.1365204 | WWW | 0.0042182 | Size | 0.0189964 |
Vaccine | 0.0433971 | Question | 0.0327215 | Test | 0.0474834 | Article | 0.0037754 | Mutations | 0.0185899 |
March | 0.0431289 | People | 0.0317176 | Positive | 0.0246366 | Italy | 0.0035626 | Claim | 0.0181024 |
Bleach | 0.0351639 | Days | 0.0288056 | Covid | 0.0211047 | Viral | 0.0034026 | NSP | 0.0174763 |
Update | 0.0325141 | China | 0.0250509 | Negative | 0.0195566 | Immune | 0.0031156 | Human | 0.0173833 |
Surfaces | 0.0304114 | Answer | 0.0243003 | Sensitivity | 0.0195245 | Org | 0.0030745 | Sequence | 0.0165536 |
Water | 0.0269988 | COVID | 0.0231053 | Specificity | 0.0179726 | Com | 0.0030745 | Protein | 0.0156211 |
Bats | 0.0239080 | Cases | 0.0200131 | Viruses | 0.0163624 | Pandemic | 0.0030736 | Diameter | 0.0155753 |
Cold | 0.0230316 | References | 0.0186644 | Infected | 0.0150874 | Cells | 0.0029813 | Lopinavir | 0.0152514 |
Coronavirus | 0.0221041 | Infected | 0.0178177 | Tests | 0.0132564 | Pathogens | 0.0029650 | Origin | 0.0145441 |
KNN | Naïve Bayes | |||||||||
---|---|---|---|---|---|---|---|---|---|---|
Number of questions | Accuracy | Precision | Recall | Classification error | F1 | Accuracy | Precision | Recall | Classification error | F1 |
100 | 79.4 | 78.2 | 79.4 | 20.5 | 78.79 | 80.4 | 78.5 | 80.04 | 19.5 | 79.26 |
200 | 82.05 | 80.55 | 82.02 | 17.9 | 81.278 | 83.56 | 80.94 | 82.82 | 16.44 | 81.869 |
300 | 82.05 | 80.55 | 82.02 | 17.9 | 81.27 | 83.56 | 80.94 | 82.82 | 16.44 | 81.86 |
400 | 83.12 | 86.71 | 83.12 | 11.88 | 84.877 | 82.53 | 80.47 | 81.99 | 17.47 | 81.222 |
500 | 84.82 | 83.2 | 83 | 14 | 83.099 | 84.82 | 83.04 | 84.65 | 15.18 | 83.837 |
600 | 87.1 | 85.7 | 87.1 | 12.8 | 86.39 | 88.12 | 86.7 | 88.12 | 11.8 | 87.40 |
The developed chatbot was evaluated thoroughly. All steps in its construction were reviewed, particularly as related to the knowledge base. Most chatbots present options which lead to further layers of options, depending on the user's response. However, our chatbot was designed for open conversations without menus, options, or directions from the system. This makes accuracy more difficult for the following reasons:
There are different human expressions for the same inquiry; There are various dialects; Not all user inputs can be predefined; hence, the chatbot must respond to unanticipated questions.
In addition to testing the chatbot with potential users, we asked medical professionals, academics, and students to use the chatbot and answer several questions regarding their level of experience, awareness, satisfaction, and recommendations. User feedback was reviewed by a medical doctor and statistics expert to evaluate the chatbot's efficiency and efficacy.
We employed four evaluation methods, based on (1) in-house; (2) experts; (3) real users; and (4) ISO 9214 standard of usability (effectiveness, efficiency, and satisfaction) [
Training and testing the chatbot during the development is done by interacting with the chatbot and then retrieving the saved history of all the questions asked and inquiries made and what the system has responded to. We determined the percentage of correct answers. Knowing the questions with wrong answers helped us reclassify some questions, anticipate new questioning methods, and redefine intents and entities. We also learned of inquiries that we had not considered.
Expert evaluation can determine whether chatbot responses are suitable or natural [
No. of sessions investigated | 796 sessions |
No. of sessions with incorrect answers to some questions | 81 sessions |
---|---|
Precision | 81.5% |
Accuracy | 89.82% |
The doctor explained some reasons behind the wrong answers. Some users asked strange and irrelevant questions such as “
Some questions, like “
We aimed to assess the following: (1) the effectiveness of the chatbot for real users; (2) the role of the chatbot to increase users’ awareness; and (3) users’ level of satisfaction. To do this, we tested the following research hypotheses (RHs) (
We solicited users through WhatsApp. A Google Forms questionnaire was distributed to determine their awareness and satisfaction. The three-part questionnaire measured: (1) knowledge of using a chatbot system; (2) awareness created by using the chatbot system; and (3) user satisfaction with the chatbot's functionality, effectiveness, response precision, and speed of response. The targeted population was 35 million citizens residing in Saudi Arabia. The sample calculated using Morgan's table [
Variable | Description |
---|---|
CHB_USE | Ease of use |
Awareness | Awareness as a method to fight the infodemic |
Satisfaction | Feeling of acceptance and fulfillment of need for accurate information. |
Variable | Mean | Std. Deviation |
---|---|---|
CHB_USE | 0.85110 | 4.1989 |
Awareness | 0.62018 | 4.2558 |
Satisfaction | 0.79174 | 4.2873 |
Variables | CHB_USE | Awareness | Satisfaction |
---|---|---|---|
CHB_USE | 1 | ||
Awareness | 0.567∁ | 1 | |
Satisfaction | 0.799∁ | 0.649∁ | 1 |
Hypothesis | Direct-beta coefficient | Indirect-beta coefficient | Mediation type observed |
---|---|---|---|
H1: Chatbot-effectiveness → Satisfaction | 0.799 (0.032)∁ | — | — |
H2: Chatbot-Using → increasing awareness | 0.567 (0.034)∁ | — | — |
H3: Chatbot-Using → Satisfaction (through increasing awareness) | — | 0.368 (0.038)∁ | Full Med |
Tests showed that use of the chatbot had a significant effect on user awareness at the 0.01 level (B = 0.567, P-value = 0.000). The correlation of 0.567 between both constructs indicates that the percentage of the relationship between chatbot program using ad users’ awareness of 56.7% is supported by the direct relationship found. Hence the second hypothesis is supported.
The third hypothesis supposes a mediation effect of user awareness between chatbot use and user satisfaction. Results of a Sobel test indicate a significant mediation effect of users’ awareness on the relationship between using the chatbot program and user satisfaction at the 0.01 level (B = 0.368, p-value = 0.000). Therefore, the third hypothesis is supported.
As mentioned above, we adopted the ISO 9214 standard to support the chatbot evaluation. This standard is based on effectiveness, efficiency, and satisfaction. Effectiveness concerns the chatbot's ability to fulfill its intended purpose. Efficiency concerns the ability to perform tasks without wasting resources. Satisfaction concerns users’ feelings that they get what they need. Of the 308 survey responses, 94% supported the high impact of using technology to promote health awareness, and 83.4% supported the use of the chatbot as a new awareness system that was better than emails and text messages. While 37.5% of respondents had used a chatbot, only 22.5% had tried a smart system to learn about the coronavirus.
ISO 9214 Standard | Achieved objectives |
---|---|
Effectiveness (77%) | Functionality achieved via: |
● Credible information from trusted sources | |
● High satisfaction with language and expressions used to answer inquiries | |
● Simple language | |
● Easy access | |
Capability of smooth user interaction | |
Efficiency and awareness (72%) | Reliability of increasing coronavirus awareness |
Efficient and timely response | |
Satisfaction (82%) | Accessibility: satisfaction with ease of dealing with chat |
Quality of information | |
Recommending that others use the chatbot | |
Interactivity: Satisfaction with use of smart chat | |
Guarantee of user privacy, since no identification or registration is required |
The COVID-19 pandemic has created an urgent need for knowledge. Smart chatbots can serve as a trusted knowledge base for three reasons. They raise awareness and encourage precautionary measures. They enable health professionals to focus on patients. They counteract the viral spread of fake news.
The proposed chatbot uses NLU to comprehend inquiries and infer responses. A profiling methodology for the knowledge base enhances similarity matching. The proposed chatbot was evaluated while it was built, by a medical doctor to test the accuracy of answers, and by 308 real users. Evaluation results and statistical analyses confirmed its effectiveness, efficiency, and user satisfaction.
For future work, we will consider adding features such as a voice assistant, especially for visually impaired users.