Consumers who have encountered the past crop of smart AI assistants are waiting for better “intelligence” from these AI-enabled platforms. For companies developing digital assistants, the expectations are even more stringent–brands design scripts for customers to follow that seem, to the company at least, like they are intuitive and necessary. Users instead hope the virtual agent will understand their requests and answer in human-like, conversational language and in a meaningful way. The user needs help solving problems and want the chatbot to feel natural to interact with including being able to make small talk.
At Elafris, we’ve created an artificial intelligence platform with personable and plain-speaking virtual agents. They communicate with users on a variety of channels, from Facebook Messenger, SMS and the web, to voice-based devices using Amazon Alexa and Google Home. Our clients are large insurance companies that we assist to create their own white label versions of the Elafris virtual assistants. The AI platform is targeted to augment three primary areas that touch the insurance end user–customer service, insurance agent advising, and rapport-building casual chat. While each of these tasks must work in concert to serve the insurance policyholder, each domain requires its own distinct approach to develop successfully.
Below we describe the modules in our service, how each one is constructed, the approaches we have chosen, and the rationales for these selections. We also share our experience analyzing different tools–when generative, neural networks are not the best choice; why we use Word2vec instead of Doc2vec; and the charm and horror of ChatScript, and so on.
At first glance, it may seem the problems we are solving are trivial. However, in the field of the Natural Language Processing, there are a number of difficulties associated with both the technical implementation and the human context.
–Nearly two billion people speak English but each speaker uses it in his own way; there are different dialects, accents, regional idioms, educational backgrounds, and individual speech patterns.
–Many words, phrases and expressions are ambiguous. A typical example is in the below meme.
These are just a few readily apparent challenges, but further complicating things is the addition of slang, jargon, humor, sarcasm, spelling and pronunciation errors, abbreviations and other linguistic irregularities that makes it difficult to work in this field.
To solve these problems, we have developed an AI platform that uses an array of approaches. The AI portion of our system consists of a dialog manager, a recognition service and important, complex microservices that solve specific problems. The microservices comprise an Intent Classifier, a FAQ service, and Small Talk.
The task of the Dialog Manager in the Elafris AI platform is as a software simulation of the communication between a customer and a live agent. Just like a live customer service agent, the Dialog Manager guides the user through the conversation scenario to some useful goal.
To accomplish this simulation, first find out what the user wants, such as calculate the cost of car insurance. Next, determine the information required from the user or from other sources, like the address and other user data, and information on drivers and cars. Finally, the service must deliver a useful answer to the user, fill in the form and give the client the result of this task. One note of caution–the service needs to also avoid asking the user questions he has already answered.
The Dialog Manager makes it possible to create such a scenario: programmatically described, building it up brick by brick with specific questions or actions that should occur at certain moments. In fact, the script is a directed graph, where each node is a message, a question, an action, and an edge which determines the order and the conditions for switching between these nodes if there is a multiple options for switching from one node to another.
If a node is closed, control will not be transferred to it again, and the user will not see questions that have already been asked. Therefore, if we perform a depth-first search on such a graph, we locate the first open node. That first open node contains a question that the user must answer next. This is repeated until all nodes are closed.
Alternately, as the user answers questions that the Dialog Manager generates, the open nodes in the graph are gradually closed. Once all open nodes are closed, the user has completely executed the prescribed script. Then the user is presented with the result, such as a list of available insurance policies and their premiums.
Let’s suppose we start out by asking the user for his name. We’ll also collect his date of birth, gender, marital status, and address. We offer the user the option to send a photo of his driver’s license in one message to take care of entering all of this information. The system extracts all relevant data and closes the nodes corresponding to this information. Closing nodes means that any questions about the date of birth and gender will no longer be asked.
The Dialog Manager also has the ability to communicate on several topics simultaneously. For example, the user says, “I want to get insurance.” The conversation proceeds with questions and answers. Then, without completing this dialogue, the user states, “I want to make a payment on one of my other policies.” In such cases, the Dialog Manager saves the context of the first topic, and switches to the new scenario. Then once the second is completed, the Dialog Manager offers to resume the previous dialogue, returning to the point in the first scenario where it was interrupted.
Closing nodes doesn’t mean the answers can’t be changed. There is an opportunity to return to questions the user answered earlier. The system saves a snapshot of the graph when it receives each message from the client, which makes it possible to return to a prior point in the conversation.
In addition to our current approach, we considered another AI strategy to implement the Dialog Manager. In this strategy, the user’s intention and parameters are entered into the neural network, and then the system itself generates the corresponding conditions and the subsequent question to be asked. However, in practice, this method requires the addition of a rule based approach. Perhaps such a implementation like this is suitable for trivial scenarios–for example, for ordering food, where you need to get only three parameters, what the user wants to order, when he wants to receive the order and where to bring it. But in complex scenarios, as in our subject area, this approach is unworkable. At the moment, machine learning technology is not capable of leading the user to the goal reliably and in the context of complex scenarios.
Our Dialog Manager is created in Python using the Tornado framework, since our AI modules were initially written as a single service. The programming language was selected so everything can be implemented without expending unnecessary resources on communication.
The Elafris platform is capable of communicating via a variety of channels, but all of the AI sections are fully client independent. All communication comes only in the form of proxied text. The Dialog Manager conveys to the Recognition Service the context, the user’s response, and the collected data. The Recognition Service is responsible for recognizing the user’s intentions and extracting the necessary data.
Currently, the Recognition Service comprises two logical parts: the Recognition Manager, which manages the recognition pipeline, and the Extractors.
The Recognition Manager is responsible for all of the basic stages of speech sense recognition: tokenization, lemmatization, etc. It also determines the order of extractors (objects that recognize entities and signs in texts), which messages will be skipped, and when it is necessary to cease the recognition process and return the finished result. This allows you to run only the necessary extractors in the most typical order.
If we ask the user’s name, then it would be logical to check whether the name came in the user’s response. Assume the name is contained in the response and there is no more useful text in the response, meaning that recognition can be completed at this point. If additional useful entities are included in the response, then recognition needs to be continued. Most likely, the person included additional personal data. Accordingly, the personal data processing extractor will need to run on the response.
Depending on the context, the order of extractor execution can be varied. This flexible approach allows Elafris to significantly reduce the load on the whole service.
As mentioned above, extractors are able to recognize certain entities and characteristics in texts. For example, one extractor recognizes phone numbers, and another determines whether the person responded to the question in the affirmative or negative. A third extractor recognizes and verifies the geographic address in the message. A fourth one works on users’ vehicle data. This passing of a message through a set of extractors is how the recognition process works.
For optimal performance of any complex system, it is necessary to combine approaches. Elafris followed this principle when developing and implementing extractors. Below are highlights of the principles we used in the Elafris extractors.
Also in the extractors we make use of ready-made solutions for natural language processing (NLP).
We examined the NLTK, Stanford CoreNLP and SpaCy libraries. The popularity of NLTK is immediately apparent–it’s first to come up in a Google search for natural language processing. It is intriguing to play with for prototyping solutions, has extensive functionality, and is quite simple. However, when architecting production-ready solutions, its performance falls short of what enterprise-level products demand.
The Stanford CoreNLP is another well known NLP offering but it has a serious shortcoming for any non-trivial project–it pulls in the Java Virtual Machine (JVM) lightness and speed. This NLP library performs at ten times the speed of NLTK, and offers much better dictionaries. Comparing Stanford CoreNLP to SpaCy, the latter is much easier to use.
Elafris currently uses SpaCy for tokenization, vectorization of the message through the use of the built-in, trained neural network, and for primary recognition of parameters from the text. Even still, the library covers only approximately 5% of our recognition needs, so Elafris has developed many added features to ensure complete recognition functionality.
The Elafris Recognition Service was not always a two-part structure. The initial version was the most trivial. We experimented with different extractors in turns and worked to discover if there were any particular parameters or intentions in the text. These early iterations bore no resemblance to AI. It was a purely rule-based approach. The difficulty we encountered was that the same intention can be expressed in myriad ways, each of which must be identified and described in the rules. At the same time, it is necessary to account for context, since the same phrase from the user may require different actions depending on the question which prompted the response. For example, from the dialogue, “Are you married?”, the user may respond, “For two years”. From this response we can infer the user is married (marital status is a boolean value). Yet the same response for the dialogue “How long have you driven this car?” – “For two years”, the solution must extract the value “2 years” but not make any inferences about marital status.
From outset, our team understood that the support of the rule based solution would require extreme effort and investment. With an increase in the number of supported intentions, the number of rules will increase much faster than a similar machine learning system. However, from a business perspective, we first sought to launch a minimal viable product (MVP) to show its utility and business value. The rule based approach permitted Elafris to demonstrate the value quickly. The team continued to utilize the rule-based MVP while working in parallel on the machine learning model of intent recognition. Once launched, the ML approach soon began giving satisfactory results, and Elafris transitioned away from the rule-based approach.
For most instances of information extraction, Elafris has used ChatScript. This technology provides its own declarative language that makes it possible to create natural language data extraction templates. Thanks to WordNet under the hood, this solution is very powerful. For example, “color” can be specified in the recognition pattern and WordNet recognizes any narrowed concept, such as “red”. At that time, ChatScript did not compare favorably to WordNet. Elafris soon discovered ChatScript is written poorly and was full of bugs. Including ChatScript made implementation of complex logic nearly impossible. The team determined ChatScript’s disadvantages outweigh its benefits, and abandoned it in favor of NLP libraries in Python.
In the first version of the Recognition Service, Elafris hit an iron ceiling as far as flexibility. The introduction of each new feature greatly slowed down the whole system. To resolve this challenge, we decided to rewrite the Recognition Service completely, dividing it into two logical parts: a set of small, lightweight extractors and the Recognition Manager, which manages the whole process.
To enable the AI platform to communicate sufficiently, to give the necessary information upon request, and to fix the user’s data, it is necessary to determine the users’ intentions (intents) based on their submitted messages. The intent list for which the Elafris platform can interact with users is limited to the Elafris client’s business scope. This scope may include such intents as find out the terms of insurance policies, fill in data about yourself, get answers to frequently asked questions, or similar types of intents.
There are many approaches to classification of intents based on neural networks, in particular on recurrent long short-term memory (LSTM) / gated recurrent unit (GRU). They have proven themselves in recent studies, but they also have a common deficiency, they require a very large sample for correct operation. On small data amounts, such neural networks are either difficult to train, or they produce unsatisfactory results. The same applies to the Fast Text framework from Facebook. Elafris explored the Facebook framework because this is a state-of-the-art solution for handling short and medium phrases.
The Elafris training samples are of very high quality: a dedicated team of linguists create data sets of phrases. They are native-level speakers of English and have insurance-specific knowledge. However, our data samples are still relatively small. The Elafris team worked to augment our own datasets with publicly available data sets. The results did not match our requirements. Elafris explored gathering phrase data sets via freelance worker services like Amazon Mechanical Turk, but this method also turned out to be ineffective; the result was particularly poor quality, and all of the work had to be completely rechecked.
Elafris continued to search for a solution that would work on small samples. The Random Forest classifier showed high quality data when trained on data that was converted into vectors by our bag-of-words model. With the help of cross-validation, our team has selected optimal parameters. Among the advantages of this model are higher speed and lower required data set size, but also the relative ease of deployment and additional training.
In the process of working on the Intent Classifier, it became clear that for some tasks its use is not optimal. For example, suppose the user wants to change the name specified on the insurance policy, or to correct the vehicle identification number. For the classifier to correctly identify this intention, it would be necessary to add manually to the data set every single phrase that could be used in this case. This is not practicable. Instead, Elafris found another method. We developed a small extractor for the Recognition Service which identifies intents via keywords and NLP-methods. Then for phrases not identified by the keyword method, our system uses the Intent Classifier for the remaining non-sample phrases.
Many of our clients have FAQ sections. The goal is to provide users with answers directly from the AI platform. To accomplish this goal, it is necessary to create a solution that (a) recognizes the FAQ query and (b) finds the most relevant answer in the database and (c) delivers the answer to the user.
There are a number of models trained on Stanford Question and Answer Data Set (SQuAD). These models work well when the response text from the FAQ contains the words from the user’s question. For example, suppose the FAQ says: “Frodo said he would take the ring to Mordor, but he did not know the way there.” If the user asks: “Where does Frodo take the Ring?”, the system will respond: “To Mordor”.
Our script was generally different. For example, on two similar requests, “Can I pay?” and “Can I pay online?”, the AI platform must to respond differently. In the first case, the person is offered a payment form or enters a payment flow. In the second case, the system responds “Yes, you can pay online. Here’s a link where you can pay.”
Another class of solutions that assess text similarity focuses on longer answers, those comprising at least a few sentences, containing information that interests the user. Unfortunately, these long-answer solutions are very unstable for cases with short questions and answers.
Another solution is the Doc2vec approach. This strategy distills large text into a vector representation, which is then compared with other documents in the same format, which then calculates a coefficient of similarity. For the product’s targeted use, this approach also had to be put aside. Its focused on long texts, but our users mainly interact using questions and answers consist of one or sometimes two sentences.
Our strategy was based on two steps. First, using embeddings, we translate each word in the sentence into vectors using Google’s Word2vec model. Then we consider the average vector over all words, representing one sentence as a single vector. Second, we take the question-vector and find the answer-vector within the FAQ database that most closely matches the question-vector, in our case cosine.
The advantages to this model include ease of implementation, easy extensibility, and simple interpretability. The disadvantages are weak optimization possibilities. This model is difficult to refine; it either works well in most use cases or it must be abandoned.
Sometimes the user writes something absolutely irrelevant, such as “The weather is good today”. This type of phrase is not included in the domain we are focused on for our insurance company clients. However, to feel natural to the end user, the system needs to respond judiciously, demonstrating the intelligence of our system.
For such challenges, the Elafris platform uses a combination of the approaches described above. The responses are based on either simple rule-based solutions or on generative neural networks. Elafris sought to create prototypes rapidly, so the team took a public data set from the Internet and used an approach very similar to the one used for the FAQ. For example, if a user has written something about the weather, our system uses an algorithm comparing the vector representations of the message and vector representations from the public data set to compare two sentences for a certain cosine measure. By this method, our system looks for a document in the public data set that is as close as possible to the subject of the weather.
Now we don’t have a goal to create an AI platform that would learn from every message received from customers. First, experience shows that it’s the way to a quick death (Remember how IBM Watson had to erase the database because it began to diagnose with bad words, and Microsoft’s Twitter-AI platform managed to become a racist in a single day). Secondly, we strive to close the tasks of insurance companies as best as we can; a self-learning AI platform is not our business target. We have written a number of tools for our linguists and the QA team, with the help of which they can manually train AI platforms, exploring dialogues and correspondence with users during post-moderation.
However, our AI platform may be ready to take a shot at the Turing test. Some users start a serious conversation with it, believing that they are communicating with an insurance agent, and one even started threatening the supervisor with a complaint when the AI platform misunderstood him.
At the present time, Elafris is working on the visual depiction of the system, the display of the entire graph of the script, and the ability to compose it using a graphical user interface (GUI).
On the Recognition Service portion, Elafris has implemented linguistic analysis to recognize and understand the meaning of each word in the message. This will help to improve the accuracy of the reaction and help extract additional data. For example, if a person fills in a car insurance request and mentions that he doesn’t have homeowners insurance, the AI platform can remember this message and relay it to the operator to contact the customer and suggest an offer of homeowners insurance.
Another feature in the works is processing feedback. After the completion of the dialogue with the AI platform, we ask users whether the virtual agent service was satisfactory. If Sentiment Analysis recognizes the user feedback as positive, the system suggest the user share his opinion on the linked social networks. If the analysis shows that the user has reacted negatively, the AI platform clarifies what was unsatisfactory, and processes the feedback, telling the user, “Thank you. We will work to improve based on your feedback.” but does not offer to share the feedback on social media.
One of the keys to making communication with the AI platform as natural as possible is to make the AI platform modular and expand the set of reactions available to it. We are working on these improvements. Perhaps thanks to this drive, users are ready to sincerely accept our AI platform as an insurance agent. The next step is to make the person grateful for the AI platform itself.