How many times have you spoken with an AI today? Think Apple’s Siri, Amazon Echo’s Alexa, Google Assistant, Google Duplex. Nowadays communication with artificial intelligence became so seamless we don’t even give it much thought. Call centers, banks, clinics, hair salons, restaurants - they all use conversational AI to have more efficient business operations.
As time passes, AI begins to sound more human than ever. However, why is it necessary to make artificial intelligence sound less robotic and more human-like? And how does the process of making your conversational AI app sound indistinguishable from a human and have a natural flow of conversation look like? Let's dig into this topic.
What constitutes human-like artificial intelligence?
According to Business Wire, the speech and voice recognition market forecasted compound annual growth rate is 19.5% in between 2021 and 2026. Even today phones, smart watches, smart speakers and refrigerators can understand their owners’ commands.
Statista shows that in 2020 there were a bit over 4 billion virtual assistants in use all over the world, by the year 2024 that number will double. There is no doubt that not only conversational AI assistants, but all conversational AI technology will grow exponentially each year. Conversational artificial intelligence strives to be as indistinguishable from human voice as possible and there are paramount reasons for that. Let’s take a look at some features that make conversational AI human-like.
These correspond to the human features that are attributed to nonhuman objects, conversational AI included. The human characteristics we’re referring to are, for instance, expressing empathy and showing emotions, such as sadness and happiness.
According to Mengjun Li and Ayoung Suh, “scholars have highlighted that anthropomorphism mitigates individuals’ anxiety and stress when interacting with unfamiliar virtual agents and satisfies their social needs”. In other words, people who converse with an AI feel more at ease if it sounds less robotic.
Now, what makes conversational AI apps sound less robotic? The first thing to take note of are speech disfluencies. Say, filler words. Whereas some people strive to get rid of “uhh”s, “umm”s and “like”s, those are exactly what make AI sound more like a human. It becomes reasonable given the fact that speech disfluencies occurring in spoken language amount to up to 20%.
Conversational AI that is human-indistinguishable, such as Dasha AI, incorporate these alongside with other speech disfluencies, such as prolongations (“sooo”, “aaand”) and hesitations (“I scheduled your dentist appointment for, umm…. 3pm tomorrow” - a pause that happens when people try to remember some information before voicing it). When building apps in Dasha AI conversational AI platform you won’t have to worry about all these as it’s already been accounted for you.
As we discussed, anthropomorphic characteristics include empathy. AI can pick up voice inflections, such as when a person on the other side of the phone is happy or angry, and reply in an empathetic way, choosing the conversational path that corresponds to the emotion being projected. The way a person raises or lowers their voice adds to the meaning of the phrases said, which can be crucial for conversational AI as a service apps to understand.
Interruptions and pauses
There is nothing more annoying than the moment you realize an AI is speaking to you because it keeps saying its scripted lines without paying attention to any of your words. Dasha AI, for example, keeps in mind that humans can understand a phrase before it’s been fully phonounced, which is why there is a parameter that you can activate, so that when, say, a customer starts speaking while AI is trying to convey its message, it pauses and listens to what the customer has to say. Undoubtedly, such pauses are what make conversational AI apps sound more human.
Now, why is it important for conversational AI to be human-like and use all of the above? Two things:
“ In normal speech, we convey emotions through a range of tricks - pauses, the timing of syllables, tone. Even in the lab, the best attempts at putting emotions like anger and fear in synthesized speech successfully convey these feelings only about 60% of the time, and the numbers are even worse for joy”, as per Technology Review. As Alan Black, a CMU speech synthesis professor, says, “the key to making AI conversations more natural are the pauses, fillers, laughs, and anticipation that help build rapport and trust. "You need mm-hmm, back channels, hesitations, and fillers, and so far our speech synthesizers can't do that. [...} it's all about building this thing that's close to what humans expect and makes it easier to have this conversation”.
How do you design a conversational app to make your AI sound more human?
Let’s take a look at the way Dasha AI can make it happen.
Dasha AI makes use of speech recognition, which, simply put, is a way for AI to understand what a person is saying to then translate it back into the format it recognizes. In speech recognition, AI and machine learning are critical as it helps avoid background noises affecting what is being heard by the machine, comprehend different accents and contexts, etc.
In order to train the AI models to have human-like voices we stick to the following procedures.
We use large datasets of audios pre-recorded by real people who read texts aloud and follows a set of rules, such as while recording the audios, there should be no background noise, the punctuation must follow what’s written in the text, the emotions and tone should correlate with the text, the voice has to sound as natural as possible, etc. Once the audios are recorded, we go through the process of audio verification in order to make sure the rules have been followed, as it’s what makes the speech synthesis process as accurate as possible.
Additionally, we modify our models and continuously add new features such as voice speed control. We ask the person who records audio materials to speak at three different speeds: slow, normal, and fast. This is done to help the conversational AI to understand the speaker regardless of what speed they are used to speaking at, which in turn makes the AI app more effective. Another crucial feature is emotion control. This goes both ways: to account for the emotional connotation AI uses when conversing, and emotional connotations that it can derive from what the person on the other side of the phone says. To succeed at this uneasy task, we train our model to derive those subtleties from the text it generates from the received spoken text. After all, what can be more human than understanding emotions?
As discussed above, speech disfluencies make AI human-like. At Dasha AI, we account for “umm”s and “ahh”s in the very first stage. Our speaker is provided with lines of text which contain nearly all possible disfluencies and she records her voice respectively. We’ve also been working on training our AI to laugh and cough - sounds a bit extra, yet such things are natural to all of us, so they should be natural to the human-like conversational AI as well.
Additionally, in order to achieve the not-so-simple task of making conversational AI to sound human, we use state of the art models like GAN for our speech synthesis, which helps us achieve the most natural sounding voice for the AI.
In order to judge just how close to human voice the models sound and check its quality, it’s critical to have models be evaluated by third parties. MTurk is the way we do it. We provide 70 different audios that were synthesized by our model (to evaluate and compare it with the previous versions), which have various categories, such as dates, times, and interjections to our third party evaluators, therefore getting 1400 evaluations per audio. However, it doesn’t stop there. After MTurk’s work is over, we do internal testing and improve the models continuously before sending them to production. Such a process helps find what’s missing - for instance, recently we’ve improved our model to make it pronounce the abbreviations correctly as well as made the question intonation much better in the most recent model.
You can also see the magic behind machine learning services that makes Dasha conversational AI apps human-like here.
Making human like artificial intelligence is not an easy task. However, keeping in mind what features make a robot sound unlike a machine, you can successfully make an AI conversation app that will make your business so much more efficient and cut your costs.
The good news is that you don’t have to be a developer to automate your business processes with artificial intelligence, Dasha’s conversational AI API can be used by those who have limited to no development experience. So why not start making your first app today?