Guidance on Developing APIs Supporting Voice Assistance Applications

This document is intended to provide recommendations and guidance on the design and implementation of Application Programming Interfaces (APIs) for the purposes of enabling voice assistant technology such as Amazon Alexa and Google Voice Assistant. The intended audience are technical practitioners such as developers and architects. This is by no means a guidance or policy document for the development of actual voice assistance applications and development teams looking to build these types of applications should confirm against applicable legislation, policies and guidelines (e.g., accessibility, privacy, official languages, security).

What is voice assistant technology? Voice assistants are commonly thought of as devices such as Amazon Alexa Echo dots or Google Home pods. This technology extends beyond the hardware into multiple form factors, including mobile apps available for Android and iOS phones. Given the rate of innovation in the space, it is best to think of voice assistants in terms of function rather than the form of their device.
Why is it important? Smart speaker purchases are on the rise, these dedicated devices are becoming more ubiquitous every day amongst average users. From an accessibility standpoint, voice assistants are seeing rapid adoption to assist with not only information gathering but also smart-home functionality (e.g., light and window-shade controls). Canadians are exploring this technology and they expect Government of Canada (GC) services to be doing the same.
When should I use it? Voice technology is well suited to targeted inquiries, such as the weather and playing music. In a GC service context, web information should already be adhering to proper formatting to assist voice technology in reading web pages. Informational searches (e.g., recalls and safety alerts) as well as process guides (e.g., how to renew a passport) are prime opportunities for voice assistant technology. At this time, the GC is not prepared to engage at the level of identity-bound accounts.
Should I replace my existing service delivery with voice assistants? Voice assistant technology is still being adopted. As such, it should not be used as a replacement for existing service delivery methods. Instead, voice assistants should be leveraged as a complement to an overall multi-channel delivery approach, meeting Canadians where they are and helping to reduce congestion at existing delivery points.

1. Key Components

Building applications for voice assistants can seem intimidating, but the Google and Amazon clients are well suited for iterative development and collaborative design. While there are other voice assistant technologies, at the time of this writing, the research conducted to produce this paper only covered the 2 major platforms with mature development ecosystems.

Devices and the Product Client - Whether working with Amazon Alexa or Google Voice Assistant, the process begins in the same way: with device-specific requests. These requests are passed via the internet to the appropriate product client. These clients are hosted in Google’s GCP Speech API or Amazon’s Alexa skill platform. These product clients are built and maintained by the manufacturer. These clients consume the information and shape requests out to the specified Voice Assistant Services. When the voice assistant services reply, their responses are passed back to the devices via these same product clients. These product clients are built and maintained by the manufacturer (i.e. Google and Amazon) so developers do not need to worry about connectivity to individual devices.
The Voice Assistant Service - A custom webhook API should be employed between the product client and the source data API. This webhook API is referred to here as the Voice Assistant Service. It acts as a translator and shapes the user experience to be more natural and less mechanical (e.g., pauses, speech cadence, inflection). As a result, the following components should be included in any webhook service design:
- Multilingual Support - The language parameter should be extracted from the product client request headers to perform the correct backend requests to the Source API (e.g. a French-language requests should provide a response with French-language data).
- SSML support - The spoken language of voice assistants is Speech Synthesis Markup Language (SSML). As a result, the service should package responses in the SSML format for a smoother user experience. See below for more information about SSML support.
- Pagination support - If a result is a collection of items, the recommended standard is that the user hears one result at a time. The user can decide to hear more items or move on to ask something else. This support is not always a part of data APIs and needs to be a part of the service.
- Caching Support - Caching information when possible is recommended to improve the responsiveness of the service. This support is especially important when data results include large collections.
The Source API - Voice assistant applications are ultimately data driven. The source API, is where these voice assistant applications retrieve the data from. Your source API should be designed using a contract-first methodology. That means focusing on creating an atomic and clear data structure that can be unpacked by any consumer to suit their needs, regardless of whether that consumer is a mobile app, a website, or a voice assistant service. The API should adhere to the Government of Canada Standards on APIs and have a clearly defined data model. Reusable API construction demands clean data in a clear structure that can be consumed by a diverse array of clients.

2. Using SSML

Speech-to-text technology has existed for a while now. Voice assistants are leveraging this technology and more in mimicking actual conversational dynamics. In order to simulate human speech patterns, SSML was developed to define a syntax for conversation flow, such as pauses and rate of speech. In order to deliver results in a clear and conversational manner, the webhook service should be delivering responses using SSML, in particular for data such as times, dates, currencies, and numbers. SSML implementation should be thoroughly tested using a wide variety of data sample sets.

3. Translation

Normally, translation is performed in isolation, with the translator extrapolating context from the document alone. Translation is more complicated for voice assistants. Voice-assistant conversations are non-linear, and as such you lose a lot of the context essential to translation. The best results are achieved by having a translator sit with a developer and run through scenarios. Then, after making the changes, the translator should be consulted for another iteration through a variety of scenarios to check that all the text still makes sense. Trade-offs may need to be made where the conversation flows vary between English and French. Adopting a common flow to simplify the solution may result in interactions in one language being more awkward than the other, while creating the absolute best experience may result in effectively separate solutions (i.e. flows) for English and French.

4. Invocation Management

Very quickly into testing, missing application invocations that others consider common will be revealed. This delta grows exponentially when a voice application goes live. Application owners should expect and plan accordingly to have someone regularly check the product clients for invocation phrases that are not handled by the app. Both Google and Amazon have made it easy to keep track of this anonymous data and quickly integrate it into the invocation flow. These updates are essential not just for convenience, but also to ensure that applications are accessible to users who might not use standard speech patterns to interact with voice assistants. Accessibility testers should be brought in during beta testing to make sure the voice assistant can be invoked by as many users as possible.

5. Building for both Amazon and Google

Typically, users should have the same experience across different voice assistant devices. In addition, both Google and Amazon voice assistants have the same requirements in terms of multilingual support and SSML response format. The main difference between the various voice assistants is the Software Development Kits (SDKs) that make the integration between the product clients and the voice enabled APIs possible. This commonality means most of the voice enabled API services should be written once for all targeted voice assistant technologies. There should be a thin-layer service contract in the webhook service to enable communication with the product clients. This thin-layer can be implemented for any supported product, while the core of the webhook remains consistent across all platforms.

Google Developer Speech Tools and Hosting - For Google, the conversation requests and phrases (called intents) are managed in Dialogflow. In order to manage this conversation, Google requires that you deploy this within their product client. Google requires that this component be hosted on Google Cloud Platform (GCP), for which the owner will be required to pay the related hosting fees. This GCP Speech API connects to the Voice Assistant Service, which can be hosted anywhere.
Amazon Developer Speech Tools and Hosting - Amazon wraps all of their development within their Alexa Developer Portal. Note that this portal is separate from their Amazon Web Services (AWS) cloud service and requires a different account. The resulting conversation and phrases get deployed to the Alexa Skill Platform, which does not require the owner to pay any hosting fees The Alexa Skill Platform connects to the Voice Assistant Service, which can also be hosted anywhere.

Page details

2021-08-12

Language selection

Search