Machine Learning, Programming

A Rudimentary Voice Authentication System with Mobile Deployment

Speaker Verification, Android Deployment, Deep Learning, Web Service

OngKoonHan

Published in

Towards AI

7 min readMay 4, 2020

For the group project component of my Android development course in university, our team built and deployed an authentication system that authenticates via a speaker’s voice profile.

With face masks now being the norm amidst this Covid-19 season, an authentication system relying on a person’s voice profile might be more useful than systems relying on facial recognition.

Overcoming facial recognition systems by covering half my face (Photo by Arisa Chattasa on Unsplash)

In this short article, I will describe the different parts of the voice authentication system and some design choices we made along the way.

Here is an overview of the article:

Voice-Auth Service Overview
User Registration Overview
User Authentication Overview
Challenges and Design Decisions
Demo Video

Most of the details will be about the high-level architecture and mobile app deployment.

Details about the Deep Learning model can be found in my other article here (Training A Rudimentary Speaker Verification Model With Contrastive Learning).

Voice-Auth Service Overview

The voice authentication system consists of a few main components:

Mobile App / Client — A Mobile app that provides an authentication service. Think of this authentication service as something similar to the “password lock” or “pattern lock” service on your Android phone, except that the unlocking is done by speaking into the phone’s mic. This could theoretically be modified for use on top of any other mobile application needing an authentication function.

Voice Authentication Server — A web server that provides voice-based authentication. The web server hosts the Deep Learning (DL) model that gives the system its voice verification abilities. The DL model works by determining whether or not two input voice recordings are from the same person.

Voice Authentication Deep Learning Model — As with many other classification problems nowadays, most problems with complex inputs (like a voice audio signal) are solved with Deep Learning. The Deep Learning (DL) model is trained offline and then deployed on to the web server, which means it can be re-trained and updated on to the webserver at any time. More details of the DL model can be found in my other article here.

User Registration Overview

As with all authentication services, a “password” for a given user needs to be registered with the system first.

For our system, the user first registers a profile and then provides a voice sample to be used as a reference during authentication later on.

User profile registration (black)
User voice reference capture (red)

The user registers a new profile on the Android app, provides some basic personal information (username, etc.) and the profile is saved on to a Firebase database. The Android app then prompts the user to submit a voice sample (reference sample) which is saved on to Firebase Storage (file storage on Firebase).

User Authentication Overview

As with all authentication services, a “password” is provided during authentication and the service checks whether the given password matches the stored reference password previously set by the user.

For our system, the user “logs in” to his registered profile and provides a live voice sample for authentication. The system compares this live voice sample against the previously provided reference voice sample and determines whether or not these two voice samples come from the same person.

User profile and voice reference retrieval (black)
User live voice capture and authentication (red)

The user “logs in” to his registered profile by providing his username on the Android app, and the app checks for the existence of the user on the Firebase database. The profile’s reference voice sample is then downloaded from Firebase storage, and the user is prompted to provide a live voice sample. The Android app then passes both the reference and the live voice samples to the webserver where the DL model compares these two voice samples and determines whether or not they came from the same person. The positive or negative result from the DL model is then returned to the Android app.

Challenges and Design Decisions

No software engineering project is free from challenges and compromises are always made to balance out various objectives.

Choice of PyTorch over TensorFlow Lite

In the initial stages of the project, I actually started out building the DL model in Keras (TensorFlow). We soon uncovered the difficulty in deploying our TensorFlow Lite model in the Android environment. All the tutorials we saw online seemed to use the pre-trained TensorFlow Lite models provided by Google and we did not see any tutorials deploying a custom-built model. I also feared the dreaded situation of getting stuck due to unavailable opcodes.

PyTorch, on the other hand, showed how to trace a given model right off the bat on their website. Granted that tracing has some limitations, it will work when data flow in the model is simple (in the sense that there is no control flow) and if you stick to PyTorch tensors and modules.

I focused on conceptualizing the high-level architecture of my DL model and quickly trained one to test it on an Android environment. The fact that the basic model worked in the Android environment gave me the confidence to proceed with investing more time and mental energy to improve the model performance (while adhering to the high-level architecture).

As long as the trained model could run on Android, I could focus on the following as the impact on the PyTorch scripting process was minimal (or none at all):

Playing with the learning rate
Stacking more layers in the classifier
Playing with the activation functions
Tweaking the data sampling method
Using different base models for my encoder (transfer learning)
etc.

Why a Web Service?

In our original design, the team wanted to build a fully native Android app to perform voice authentication.

Lack of Audio Signal Facilities in Android — Android can handle, read from, and play a multitude of media files and file formats. Android can store media input from the phone into a variety of file formats as well.

The one critical thing that I needed, which Android did not provide, was to convert an audio file into an audio signal or byte stream. It didn’t help that the Javax Sound audio processing library was not available in the Android Java subset.

After scrolling through endless websites on how to parse .wav files and how to manage sampling rates, we decided that this was not worth our time with the project deadline looming.

Lack of Signal Processing Libraries in Java — While building the data preprocessing pipeline for the DL model, I relied heavily on the Python LibROSA (Librosa) library. Librosa automatically handles many audio processing tasks, like automatic downsampling or upsampling to the target frequency (critical as the DL model analyzed the audio spectrogram) and the creation of the melfilterbanks and the melspectrograms.

We wanted to use the Chaquopy library to automatically convert our Python code which used the Librosa into a Java-compatible format, but the Librosa library was not properly supported by Chaquopy (Numpy is supported, but I think SciPy is not fully supported).

While we did find Github libraries that have manually recreated “Librosa-like” functions in pure Java, the lack of good signal processing libraries in Java still forced us to manually handle the signal processing steps.

Web Service in Python — Ultimately, we abandoned the plan to deploy our model in the Android environment altogether.

Instead, we decided to change our approach and hosted the DL model on a web server instead that was powered by Flask. Since we could work on a Python environment, wrapping the DL model into a web service was very straightforward and we focused on making our Android phone interface with this web service instead. Managing files on Firebase storage and on the local Android file storage is another challenge in and of itself, but it is a more manageable one.

Because of these challenges, we were forced to decouple our voice authentication service and our Android authentication app, resulting in this architecture.

Model Size Limitations

The DL model was trained and hosted on my local machine which has a GPU with 3GB of VRAM. While this was enough to get the model trained and hosted for prediction, the size of the base model that I could use was limited.

Granted that we initially wanted to deploy the voice authentication DL model on to the mobile phone, we started out with the most compact image model MobileNetV2, a model created by Google intended for use in resource-limited environments.

When we decided to host the DL model as a web service instead, I changed the base model to DenseNet121, the largest one that could fit on my (small) GPU. The more powerful base model improved the classification performance of the voice authentication DL model significantly, but I was ultimately limited by the size of the GPU VRAM. Even larger models like ResNet or ResNeXt could not be used, unfortunately.

Demo Video

Here is a live demo that we recorded (pardon our Singaporean accents😄). Enjoy!

Credits to my team who put in an incredible amount of work and made the seemingly impossible possible: Ng Qing Hui, Gabriel Sim, He Yicheng