Guidance on Developing APIs Supporting Voice Assistance Applications

On this page

  1. Key Components
  2. Using SSML
  3. Translation
  4. Invocation Management
  5. Building for both Amazon and Google

This document is intended to provide recommendations and guidance on the design and implementation of Application Programming Interfaces (APIs) for the purposes of enabling voice assistant technology such as Amazon Alexa and Google Voice Assistant. The intended audience are technical practitioners such as developers and architects. This is by no means a guidance or policy document for the development of actual voice assistance applications and development teams looking to build these types of applications should confirm against applicable legislation, policies and guidelines (e.g., accessibility, privacy, official languages, security).

1. Key Components

Building applications for voice assistants can seem intimidating, but the Google and Amazon clients are well suited for iterative development and collaborative design. While there are other voice assistant technologies, at the time of this writing, the research conducted to produce this paper only covered the 2 major platforms with mature development ecosystems.

2. Using SSML

Speech-to-text technology has existed for a while now. Voice assistants are leveraging this technology and more in mimicking actual conversational dynamics. In order to simulate human speech patterns, SSML was developed to define a syntax for conversation flow, such as pauses and rate of speech. In order to deliver results in a clear and conversational manner, the webhook service should be delivering responses using SSML, in particular for data such as times, dates, currencies, and numbers. SSML implementation should be thoroughly tested using a wide variety of data sample sets.

3. Translation

Normally, translation is performed in isolation, with the translator extrapolating context from the document alone. Translation is more complicated for voice assistants. Voice-assistant conversations are non-linear, and as such you lose a lot of the context essential to translation. The best results are achieved by having a translator sit with a developer and run through scenarios. Then, after making the changes, the translator should be consulted for another iteration through a variety of scenarios to check that all the text still makes sense. Trade-offs may need to be made where the conversation flows vary between English and French. Adopting a common flow to simplify the solution may result in interactions in one language being more awkward than the other, while creating the absolute best experience may result in effectively separate solutions (i.e. flows) for English and French.

4. Invocation Management

Very quickly into testing, missing application invocations that others consider common will be revealed. This delta grows exponentially when a voice application goes live. Application owners should expect and plan accordingly to have someone regularly check the product clients for invocation phrases that are not handled by the app. Both Google and Amazon have made it easy to keep track of this anonymous data and quickly integrate it into the invocation flow. These updates are essential not just for convenience, but also to ensure that applications are accessible to users who might not use standard speech patterns to interact with voice assistants. Accessibility testers should be brought in during beta testing to make sure the voice assistant can be invoked by as many users as possible.

5. Building for both Amazon and Google

Typically, users should have the same experience across different voice assistant devices. In addition, both Google and Amazon voice assistants have the same requirements in terms of multilingual support and SSML response format. The main difference between the various voice assistants is the Software Development Kits (SDKs) that make the integration between the product clients and the voice enabled APIs possible. This commonality means most of the voice enabled API services should be written once for all targeted voice assistant technologies. There should be a thin-layer service contract in the webhook service to enable communication with the product clients. This thin-layer can be implemented for any supported product, while the core of the webhook remains consistent across all platforms.

Page details

Date modified: