VoiceThesis

The Third Generation of Conversational AI

Voice-enabled products are reducing friction and maximizing utility all while being more conscious of end-user data privacy concerns.

Highlights

Overcome technical barriers by using publicly available components
Improve user experience and offer greater data privacy with on-device AI
Empower teams to take advantage of the open conversational AI ecosystem
Manage the increase in complexity as your voice skills grow
Consider voice UI designs that reduce user effort

After a decade that produced two generations of voice assistants, the technological enablers that will power the next wave of conversational AI are evident now. Speech recognition accuracy has improved, while taking the concerns of privacy conscious users into account. The ecosystem of voice AI components and tools is more open now, which enables conversational interfaces suitable for many more situations. Developers no longer need to be gated by the limitations of routing based on invocation phrases. We can now build voice enabled applications that not only delight users with one off novel experiences, but truly create conveniences that keep users coming back.

Commoditization

AI components for conversational AI are commoditized; specifically, there are plenty of good choices for “speech-to-text”, i.e., automatic speech recognition (ASR) and text-to-speech (TTS) components, that are publicly available from cloud AI services, technology vendors as well as open-source developers.

Out-of-the-box ASR accuracy has improved to a point that developers can get started on building most applications without worrying about the intricacies of the models running under the hood, and there are a growing number of good options. All cloud AI providers host speech recognition services, and many offer serverless streaming APIs. These providers support a choice of spoken languages, and their features are comparable and come with pay-per-use pricing.

For those who need a greater degree of control over their data and costs at scale, multiple off-the-shelf solutions for self-hosted ASRs are available. Models pre-trained on public datasets are freely available and offer highly competitive performance in many cases. Customizations, for those who need them, are also possible. The same applies to TTS services - TTS voices that closely mimic human speech under normal speaking conditions are available in many languages. Both cloud-based services, as well as self hosted TTS solutions, are readily available.

Another necessary component for conversational AI is natural language understanding (NLU). Today’s ASR or TTS components have a web-scale vocabulary making the output specification practically universal. Contrary to that, NLU intents and slots are still heavily application dependent. Correspondingly, NLU needs to be customized for every application. Fortunately, tools for creating custom NLUs are plentiful. Cloud AI providers offer web-based interfaces to build and host custom NLU services. Other solutions are available, such as open-sourced, as well as commercial NLU tools. Using pre-trained neural language embeddings, these tools can typically achieve high performance using only a small amount of training data that has been annotated with intents and slots.

Developers should follow these best practices to take advantage of the trend towards the commoditization of conversational AI components:

State of the art speech and language processing components are accessible. Take your pick. Mix and match.
The return on investment for developing ASR and TTS from scratch will be elusive. Build products using ready to use components, at least at first.
Avoid getting locked into a component AI choice. Systematically evaluate the components and switch, if another offers better performance.
Commoditized components give you greater control over conversational AI architecture. Build for your application, not for parity with platform providers.

On-device AI

Consumer hardware platforms and operating systems are increasingly AI enabled. Both Android and iOS devices are starting to offer high performing, on-device speech recognition (i.e., no need to stream speech to the cloud). Not only does this allow developers to eliminate recurring ASR costs, but it also offers the end-user greater speech data privacy, as speech may no longer need to be streamed out of the device.

While this trend is still in early stages, once rooted, it will dramatically change the architecture and interaction design of conversational systems. This change will be a key characteristic of the third generation of conversational AI. Implications of this trend include:

Extremely low latency - Conversational AI will be capable of reacting noticeably faster. Once users are exposed to this reduced latency, the response delay of today’s smart speakers will become unacceptable. Interaction design will evolve to react often, but in subtle ways that minimize the user’s cognitive load and maximizes utility.

Continuous listening is the next step in the progression from push-to-talk voice input to wakeword detection. Continuous listening will further reduce friction in voice interaction, and will practically eliminate all unnecessary activation effort for users once it is supported by the platforms.

Environment sensing - A side effect of hardware and operating systems optimized for continuous listening will be audio-based environment awareness. As audio is processed on-device, users’ privacy is preserved. Environment sensing will offer contextual cues to conversational AI about where the user is, what the user is doing and who/what else is around (e.g., if the TV is on).

Ongoing conversation - Interaction design will shift from performing momentary tasks to maintaining long term relationships between the AI enabled system and the user. The system will be capable of taking recurring cues from past interactions to inform future responses.

The trend to bring AI capabilities closer to users will be further propelled by edge computing, which 5G infrastructure will operationalize. New software frameworks will guarantee the availability of high performing perceptual intelligence (i.e., audio and video understanding) in one hop or less using a combination of on-device and at-edge computation. Hardware and operating systems, optimized for sensing and perceptual intelligence, will completely disrupt app development.

Takeaways

Application architectures that can use on-device ASR, NLU and TTS should be considered when available.
Employ interaction design that allows users to jump right to expressing their intent and minimize/eliminate voice AI activation effort.
On-device implementation of cognition that requires minimal connectivity should be explored.
Be aware that on-device AI ensures greater privacy and reduces developer access to end-user data. Establish quality control and monitoring processes that do not rely on logging large amounts of end-user data.

Dev-ocratization

From opening up platforms to sharing tools, developers are being empowered with greater control over the capabilities of conversational AI. The primary indicator of dev-ocratization is the increase in choice of tools and platforms available to developers. Platform providers offer immediate access to a potentially large user base. Tools and frameworks serve developers who want to add conversational agents to mobile apps, customer service channels, web and custom hardware. Because of this pull towards tools and frameworks, platforms will continue to open up further, will allow custom integration and give more choice to developers.

In fact, developers already have more control over three aspects of conversational AI:

Initiation - Platforms unify the way voice interactions are initiated. Whether it is push-button or wakeword, platform controlled initiation gates the funnel of users engaging with skills. While it’s good to have a common way of starting interactions, the friction of prevailing initiation modes are limiting engagement. To address this, custom conversational AI systems are exploring alternate modes of initiation, a few of which include, speaking anytime with continuous listening, directing gaze, gesturing and system initiated pull. Conversational systems can sense their environment and infer who the user is addressing. These modes reduce initiation friction and will gain wide adoption in third generation conversational AI systems and platforms.

Fulfillment - The flow of control to skills has been strictly dictated by platforms. Preferred partners are favored for the high frequency use-cases, such as weather and music, and the discoverability of skills is limited. As the set of available skills grow, the challenge of routing user requests to the appropriate fulfillment gets unwieldy. This same challenge shows up when using voice AI development tools that require fulfillment modules to follow specific frameworks (e.g., slot filling). Addressing this challenge requires shifting away from routing based to ranking based request fulfillment, which has been evident in recent developments. Of particular interest is the multi-expert dialog management approach, which reduces the burden of routing for the developer. As a result, users can effortlessly switch between skills related to the current conversational topic. Organizations seeking voice enablement of their products can empower multiple teams to create feature specific skills, and unify them all through the multi-expert approach without an exponential increase in complexity.

Integration - In order to increase system awareness and offer wider delivery, conversational AI must be integrated with additional sensory channels, as well as multiple delivery channels. Users should be able to continue conversations across channels switching between app, web and chat. This type of ubiquitous conversational experiences allow your products to reach the users wherever they are without being locked into a device. Custom conversational AI products capable of using additional signals to increase the system’s visual, location, environmental and network awareness, are able to reduce the communicative and cognitive burdens of the user.

While there are a lot of choices empowering the conversational AI developers, at a high level, there are two different types of dev-ocratization trends in play: authoring and programming. The first wants to simplify the most frequent use cases by providing tools and templates, and the latter takes the training wheels off allowing developers to create limitless variations in conversational AI capabilities. We have seen analogous approaches in previous interaction paradigms, such as web and app development. For those choosing between these two approaches, the defined scope of the product should be the guide. The choice will often come down to the priorities of speed of delivery and ability to innovate.

Conversations: Not an after-thought

Even though voice enabled interactions are popularly referred to as conversational AI, there have hardly been any conversational capabilities in the majority of the mainstream platforms. Where we do see glimpses of multi-turn interaction, they usually take the form of confirmation prompts. This is because the traditional architectures of these systems have been rooted in the metaphor of controlling devices by voice command (i.e., voice command control).

In these systems, NLU intents have a one-to-one mapping to a fulfillment behavior (i.e., an action or a response), which simplifies their control logic by reducing it to lookup-based routing. This works reasonably well for voice command control systems. As an after-thought, the ability to retain information from recent conversational turns is added to such systems (e.g., by passing dialog state through headers). However, soon after, the lookup-based intent routing approach starts to limit the ability to conduct multi-turn interaction, as the complexity of specifying and maintaining such routes for a large number of states becomes unmanageable.

Third generation conversational AI approaches the fundamental requirements of multi-turn interaction from the ground up. Many of the efforts to build truly conversational capabilities are underway now, and are publicly known. These capabilities allow users to have ongoing conversations that span topics and time. Within a topic specific skill, the function of multi-turn interaction will elevate from confirming with the user to assisting the user to achieve their task goals. Information shared across skills increases the system’s awareness of the user’s goals and allows users to perform complex tasks. Conversational UI enabled with social interaction capabilities fills in the gaps and helps create an ongoing conversation with many task specific threads.

Common to all of these capabilities are the principles of multi-expert conversation management. Complex system behaviors are decomposed into a collection of precision tuned experts and the conversational management machinery that previously relied on intent based routing is replaced by expert response selection. The switch over to multi-expert approaches will not only upgrade existing conversational skills, but also encourage future skill designers/developers to focus on creating skills that serve tasks precisely (i.e., no false positives). Consequently, skills will automatically work in cooperation, and cases where the collection of skills are unable to to serve the user’s request will be delegated to conversation management.

Shape & form

Web-based apps maximize reach and mobile-based apps are tuned to increase retention. As a UI paradigm, voice hasn’t yet been able to do better than either web or mobile-based apps. In theory, smart speakers have the potential for great reach and extraordinary retention. The platforms (Alexa, Google, HomePod) have invested in solving the reach problem by promoting the product and incentivizing users to place one of these devices in every space; however, the skill discovery challenges limit the reach for all but the most popular skills. On the other hand, the retention problem is primarily due to friction (i.e., user effort to get tasks done).

The voice thesis states that use of voice UI will increase by reducing user effort. Each of the past waves of conversational AI have increased the adoption of voice enabled systems by delivering greater convenience to users. Third generation conversational AI will further the shift towards the voice UI paradigm. In this article, we have outlined the technical enablers of this paradigm shift. Organizations seeking to strategically position themselves for the voice UI paradigm shift should include these technical capabilities in their innovation roadmap.

As was the case with Siri (the first generation), and Alexa (the second generation), third generation conversational AI will lead to innovative products and features. In verticals where customers expect conversational interaction, businesses are offering services based on AI (e.g., healthcare, retail banking). These services are not motivated by the goal of reducing costs, but rather to offer a more consistent, always available and, in some cases, safer user experience. At the same time, the third wave will create new quantifiable efficiencies. We will also see novel uses of conversational AI, some of which will become mainstream. Beyond interactive systems, offline uses of conversational AI for voice data analytics and content processing (e.g., understanding, organizing and generation of media) will power numerous end-user facing capabilities as well as business operations.

The great beyond

Glimpses of capabilities that will become synonymous with conversational AI are present already. As the third wave plays out over the next decade, we will see additional technical enablers for the fourth and fifth generation of conversational AI systems. Consumer hardware that is ready to run AI models, and advances in end-to-end dialog modeling approaches will elevate the user experience by increasing reactiveness and variability. A shift towards sensory augmentation of our environments (home, workplace, automobiles) will not only increase contextual awareness of voice skills, but also bring multi-modality and multi-user applications into focus.

Contact

Understanding the trends that are creating the next generation of conversational AI can help you make strategic decisions and prepare your products to make the most of the next growth phase. VoiceThesis helps organizations accelerate their AI strategy. We create client ROI by demystifying tech and solving technical challenges that help clients create market relevant products and services.

Reach out to start an obligation free, exploratory discussion. You can email us directly: contact@voicethesis.com or use this contact form.

Credits: Website template from Colorlib | Vector graphics from freepik.com | Graphics from unDraw