VoiceThesis

Conversational AI: Past & Present

An overview of the past decade in conversational AI

In order to identify the technical enablers that are driving the next generation of conversational AI, it is important to understand how they connect to key trends in previous generations. Below we offer an overview of the past and present of conversational AI.

Ancient history

Voice-based interaction systems can trace their roots back to the days of telephony. Interactive voice response (IVR) based customer service offered over the phone allowed companies to reduce their operating costs.

However, qualifying IVR systems as conversational AI would be inaccurate. The use of AI component technologies, such as automated speech recognition (ASR) and natural language understanding (NLU), in IVR systems was limited to research labs. Practicality constraints of connecting these systems to the internet limited their skill set to accessing and updating information in local databases. The poor, and often jarring, IVR user-experience paved the way for web and mobile-based self-service applications.

In parallel, throughout the 1990s and 2000s, academic research in dialog systems thrived and created prototypes that showed glimpses of the first generation of conversational AI.

Siri, the first generation

Siri represents the first generation of conversational AI. Many factors contributed to the launch of Siri by Apple, including the growing use of smartphones; greater network connectivity; a decade of impressive research in dialog systems funded by the US government, as well as prominent industry labs of the time; ease of integration with third-party services enabled by improved web-based API and, in particular, Apple of the 2000s.

However, several technical limitations and product choices came to define this first generation; these choices have also set the stage for what has changed since. First and foremost, speech recognition technology in the late 2000s was not ready for wide consumer use - it struggled with difficult words and accents and choices for ASR providers were limited. Compared to using an app, push-to-talk voice input interface led to high levels of user effort to complete a task. Siri was completely closed to app developers and was locked into iPhones, which severely limited Siri functionality. The functionality that Siri did have could be best described as command and control. Conversational capabilities, typically characterized by multi-turn dialog, were largely limited to a follow-on confirmation.

Despite these limitations, Siri paved the way through an extensive consumer awareness campaign showcasing all the things Siri could do, and users were willing to talk to machines again.

Second generation: More than the smart speakers

Amazon’s Echo/Alexa line of products were defined by one key difference - the wakeword, a fresh take on the voice input interface rooted in the technical constraints of the time. Inspired by popular cultural depictions of voice-based human machine interaction, this feature was promoted as a “natural” interface that appeared to reduce the friction of starting a voice interaction. The second generation of conversational AI took its initial form as the smart speaker. These speakers had found a place in over a third of American homes by the late 2010s due to a favorable consumer disposition created by the first as well as the second wave. This adoption was also furthered by a series of bundling and discount-based promotions.

The technology and product choices of the second generation were an improvement. ASR accuracy improved as a result of a serious increase in the scale of model training data, as well as success with neural network based input feature transforms. These improvements also made the ASR more adept at recognizing accented speech.

The computational power needed by ASR required speech recognition to be performed on the server side. Wakeword detection, on the other hand, could be successfully ported to smaller devices. Like Siri, second generation voice assistants rely on streaming user speech to their servers. Instead of the push-button, wakeword detection triggers the start of speech streamed to the backend where it is processed by a pipeline of recognition, interpretation, skill invocation and response generation components.

Unlike the first generation, Amazon Alexa and Google Home chose to open up their platforms to voice skill developers. They also supported third-party hardware manufacturers to embed these voice assistants. Voice skill developers quickly learnt and embraced concepts such as invocation, intents, slots, response and dialog state; however, with open platforms came the challenge of sharing control of the voice assistant with developers. Platform providers chose to take control of the routing of user requests to voice controls by universalizing natural language understanding (NLU). This approach has created a bottleneck in the democratization of voice skill development, as desirable invocation phrases were reserved by platform providers and early developers. Furthermore, the platforms’ use of wakeword based voice input created a greater degree of friction for conversational AI. For example, users found themselves yelling out the voice assistant’s name over again even when the topic of the conversation had not yet changed. This user interface limitation caused voice assistants to be limited to command and control type skills.

Alongside the smart speaker, the second generation of conversational AI included two additional developments. Looking for a potential paradigm shift, companies voice enabled their apps and added chatbots to customer support channels. Voice enablement offered a differentiated experience to users and created strategically valuable resources (in other words, expertise, partnerships and data), which will help these organizations be well positioned for when the UX paradigm shifts. The initial hype of chatbots was ushered in by Facebook messenger and Slack integrations. Most of these chatbots shared an outdated vision and, correspondingly, the shortcomings of IVRs. As the hype settled, norms emerged. We understand now what chatbots can offer - instant anytime fulfillment of some of the most common user requests, while ensuring consistency and compliance. Use of chatbots increased operational efficiency within organizations, and augmented their customer support.

Looking ahead

The threads tying all second generation conversational AI are the technical enablers. These include the commoditization of state-of-the-art technology components (ASR, NLU, TTS); the platformization that has welcomed voice skill developers; the ease of integration through APIs; and the ability to deliver conversational AI experiences over a variety of channels (app, web, text, chat, embedded).

The technological landscape continues to shift. Access to the best AI components combined with techniques managing conversational complexity, and addressing user privacy concerns, are empowering voice skill developers. Many more voice-enabled products are viable at different stages. To learn more about what has changed, and how it affects your AI strategy, check out our outlook on the third generation of conversational AI.

Contact

How does conversational AI fit into your product strategy? Reach out to start an obligation free, exploratory discussion.

You can email us directly: contact@voicethesis.com or use this contact form.

Credits: Website template from Colorlib | Vector graphics from freepik.com | Graphics from unDraw