December 13, 2022

Explained: How HiJiffy’s voice assistant works

Understand better the technology behind our voice assistant

HiJiffy’s conversational AI has been excelling in processing text messages and responding to all kinds of guest queries in the same form. As we always keep an eye on trends in communications, we have been innovating and developing our Aplysia OS to bring powerful new features to guests and hoteliers.

Communicating through voice notes is gaining popularity at an astounding pace – as many as seven billion of them are sent daily on WhatsApp alone. In line with that, the expectation of providing voice assistance by businesses is also increasing. With the latest technological advances at HiJiffy, AI-powered voice assistants are now also available in hospitality.

What are voice assistants?

Voice assistants are already a reality in various industries, and people incorporate them in their everyday life for convenience, comfort and time-saving, for example:

Make a call, send a message, and receive, open, and read messages;
Search for news, weather predictions, currency, and definitions;
Make notes and reminders;
Schedule and reschedule events;
Set an alarm, make a screen brighter, turn on/off Wi-Fi, or play music, among other standard screen functions;
Display the route from point A to place B in navigation searches;
Navigate through leisure: find fun things to do in the city, movies to watch, and weekend getaway destinations.

Voice assistants or voicebots are a subset of conversational agents powered by multimodal AI that can interpret natural human speech and answer with an artificial (yet human-sounding) voice. Voice assistants can hold conversations and provide answers using voice recognition, artificial intelligence, and natural language processing (NLP).

Say hello to the new voice in your hotel

The mission of HiJiffy is to better connect hotels with their guests by developing the most advanced conversational AI for hospitality. Our voice assistant understands guest requests made through audio format, such as check-in and check-out times or hotel spa and restaurant opening hours. The voice assistant will be able to use its existing knowledge to deliver answers not only by text but also through voice messages.

How does a voice assistant system work?

In the following sections, we will take a closer look at the essential components of the voice assistant system (also known as interactive voice response system):

Speech-to-Text (STT)
Text-to-Speech (TTS)
Decision making
Architecture

These are the four key elements necessary to have a functional voice assistant system, yet other smaller processes enhance the system.

Our voice assistant is based on the architecture of HiJiffy and Aplysia OS, creating a solid foundation that will allow users to access the tools and features already existing in our Guest Communications Hub.

Voice assistant 01 explained: how hijiffy's voice assistant works

The organisational system of our voice assistant is as follows:

Receiving audio from the user.
Transforming the audio into text (STT).
Predicting the best response to the text (Decision making).
Transforming the response into audio (TSS).
Returning the answer to the user.

Speech-to-Text (STT)

Turning audio files or spoken input from a microphone into text is known as speech-to-text. An ideal STT should be able to “perceive” the given input (audio), “recognise” the spoken words and then subsequently use the recognised words as input (final text).

We provide a generic model currently widely used among the many available models and variants. It is a statistical historical approach consisting of three key components.

Voice assistant 02 explained: how hijiffy's voice assistant works

Extraction of features
- Obtaining different features, such as power, pitch, and vocal tract configuration from audio. In this way, it is possible to recognise the essential audio parts, such as what is not background noise and irrelevant information.
Acoustic model
- Turning the extracted features into a statistical parametric speech model, predicting what phoneme each waveform corresponds to, typically at the character level.
Language model
- Determining whether word combinations are feasible with the use of a language model. It uses grammar principles and probabilities that specific sounds appear together in sentences.

There are other approaches available; this is just an example to demonstrate how to get a text from audio.

Text-to-Speech (TTS)

The inverse of speech-to-text conversion is text-to-speech, a process that models natural language and converts text into speech for audio presentation. The most recent TTS models follow the following structure:

Voice assistant 04 explained: how hijiffy's voice assistant works

Text preprocessing and normalisation
- Simply the precursor step for the input text. It will be converted into the target language linguistic features in the form of a vector input into the acoustic model.
Acoustic model
- Conversion of the preprocessed/normalised text into a sequence of waveform blocks which will then create the voice of the voice assistant.

In this way, a computer can reproduce voice through text. Technology has advanced so much that it is possible to clone voices; for instance, to generate a voice that sounds exactly like yours so that a voice assistant can use it.

Decision Making

To decide the most appropriate answer to the user’s message, one must first grasp the substance of the message that the user has sent. Like HiJiffy’s chatbot, the voice assistant will employ models, Aplysia’s NLP, and our optimised hospitality models to determine the best answer to the customer’s request.

Architecture

The HiJiffy architecture will be employed, and all the processes and functionality that the conversational AI currently provides will also be available in our voice assistant. In other words, the features available in our conversational AI will also be available in our voice assistant and chatbot, one of them being the ability to determine the language spoken by the user.

Some minor operations improve our voice assistant’s functionality in addition to the main ones mentioned above; for example, choosing a specific voice for your hotel’s voice assistant to best match the brand.

In conclusion, the HiJiffy architecture, distinguished for its reliability and quality, and the Aplysia, which delivers all the underlying innovation, will serve as the foundation for Voice Assistant.

Senior AI Engineer

Explained: How HiJiffy’s voice assistant works

What are voice assistants?