"Do conversational technologies have a future" is something that an executive might ask herself when considering an initiative to automate call center conversations. It is also something that a developer might ask himself when considering whether to add a new conversational AI-as-a-Service tech to his stack of go-to tools.
I can't help but be expectedly biased when answering this question. This bias does not come from working for a conversational AI startup. Rather, I ended up at a conversational AI startup because of my strong conviction that I should not have to use my fingers to talk to my phone.
The past
Voice is the native interface for human communication.
150,000 BC, English Channel Islands. We have slain a few mammoths and are feasting. Calories are hitting the bloodstream and a desire to celebrate rises from within. You start humming gratitude, I beat two stones together in rhythm to your song.
We have a system of symbolic language. We started off using a language of simple sounds and gestures; we have progressed to a language of words.
Thousands of years later and we have our own culture, defined by our language. For culture is the product of language, to paraphrase Joseph Brodsky. The words that we speak follow us from our infancy to our death. The songs stick and stay with us for life. Our words are how we express love, fear, gratitude, anger and a myriad other states, thoughts and feelings.
If we encounter tribes, speaking in languages hitherto unknown, we use our hands to gesture, sticks or fingers to draw in sand and speak louder, clearer. We enunciate. We gesticulate. We use the language of gestures until we learn to speak their language or they learn to speak ours. From that point, we converse in words. Because voice is the native interface for human communication.
The present
We live in an amazing world. We live in the literal world I dreamed of, as I hungrily devoured all the science fiction in my post-soviet Russian childhood library. As a civilization, we have finally planted our feet firmly on the doorstep of space for humanity, not only for a select few. Our machines work tirelessly to make our lives better. When we encounter a stranger who does not know our language, we don't have to gesticulate, we don't even have to draw in the sand. We pull out an iPhone and launch Apple translate (thanks iOS 14).
Time and time again we see technology develop in ways that allow for more instantaneous and personal communication. And for humans, personal always equals voice. Visual is great but voice is non-negotiable. Want to push back on this one? Just give me an example of the last time you had a video call where you chose to communicate via text chat. I'll wait.
The majority, if not all, technologies are technologies of communication. Telegraph, telephone, bank wire, the Internet - they are all ways for humans to communicate. And while we are constantly expanding the ways in which we communicate with each other, we are more or less content to communicate with machines in the same way we did 50 years ago. In the 1960's Vint Cerf and the boys showed us a mouse and a keyboard and we've been happily using them since.
Well, I may be exaggerating. We do have some minor conversational technologies. Over 25% of US adults own a smart speaker. Yet they sit down to work behind a keyboard and a mouse. Supposedly half of US drivers have a voice assistant available in their car. Yet they still whip out the phone to check Facebook as they wait in the drive-through.
If we have ways to interact with machines using voice and given that voice is the native interface for human communication, why are we not always communicating with our machines using voice user interfaces?
The longer answer is that today's voice user interfaces fail on two fronts
- They are command-response interfaces, not conversational. What I mean is that Alexa is great at turning on some music but you can't have a conversation about music with her, augmented with her playing selections of tunes to illustrate what she is talking about.
- Their voices are either too robotic (fail to pass for human) or too perfect (land in the uncanny valley).
- They are limited in the depth of their conversational structures and breadth of digressions.
- They have major issues with remembering the context of the conversation; many (if not most) treat each new reply as a new conversation.
- They are limited in data access.
This all leads to today's voice UI failure to replace tactile/visual UI.
The simpler answer is that the future still needs to be built. For now, we are still using the language of gestures to communicate with machines.
The future
The future of voice technologies is Iron Man's Jarvis, 2001 Space Odyssey's HAL3000 and Eddie the ship’s computer in the Hitchhiker's Guide. Entire operating systems, run by voice communication. Communication, not command. These voice user interfaces succeed where today's fail because:
- They are fully conversational. They don't require commands, instead they extract intents from ongoing conversation.
- They not only sound human, their voices have idiosyncrasies and inconsistencies that, on a subconscious level, identify the speaker as a human to our ears.
- They are able to generate new conversational pathways on the fly, leading to conversations of virtually unlimited depth.
- They hold context of the conversation and, just like a human, can jump between topics and hop back and forth along the conversational timeline.
- They have unlimited data access privileges to the systems they operate and to the wider world knowledge base.
Since language is a framework for intelligence, I believe that only through conversational AI technologies will we come closer to general artificial intelligence. If you don't think we'll be here within the decade, we live in different worlds.
We no longer need iPhones. We have a small earpiece, connected to our cloud instance of a virtual assistant with whom we carry on as we would with an old friend. Who understands our needs better and better with every day and can offload audio information directly to our ears and visual to a tiny contact lens. Powered by nano processors, the technology gets enough of a charge out of the electricity naturally coursing through our bodies. There is the question of connectivity but I have no doubt human genius will solve that as well. When we meet a person who speaks a language we do not know, the assistant replaces the audio channel of their speech with the translated version. The app that handles this is obviously called Babel Fish.
I, for one, have a lot more faith in the ability of the human race to persevere and excel, than in our capacity to destroy ourselves. We'll see how well this post ages in 2030. (Yeah, I created a Google Calendar event).
We are still in closed Beta for Dasha Conversational AI Studio. If you want to be part of the select cohort of developers building on what we hope will be the future of voice conversational technologies, go ahead and join our dev community to grab your API key.
See you in the future.