Skip to main content

Speaker Verification

This guide demonstrates how to perform Speaker Verification with Phonexia Speech Platform 4 Virtual Appliance with Voiceprint Extraction and Voiceprint Comparison.

The objective of Speaker Verification is to confirm or deny whether a speaker in a media file is the same person as they claim to be. We already have a media file with the voice of John Doe. Now, we want to verify that it's John Doe speaking in another media file. The process of verification consists of Voiceprint Extraction and Voiceprint Comparison. Both steps are explained in the following sections.

Attached, you will find a ZIP file named recordings.zip containing two mono-channel media files – john_doe.wav and unknown.wav, which will be used as examples throughout the guide.

At the end of this guide, you'll find the full Python code example that combines all the steps that will first be discussed separately. This guide should give you a comprehensive understanding on how to perform Speaker Verification in your own projects.

Prerequisites

Follow the prerequisites for setup of Virtual Appliance and Python environment as described in the Task lifecycle code examples.

Run Voiceprint Extraction

To run Voiceprint Extraction for a single media file, you should start by sending a POST request to the /api/technology/speaker-identification-voiceprint-extraction endpoint. file is the only mandatory parameter. In Python, you can do this as follows:

import requests

VIRTUAL_APPLIANCE_ADDRESS = "http://<virtual-appliance-address>:8000" # Replace with your address
MEDIA_FILE_BASED_ENDPOINT_URL = f"{VIRTUAL_APPLIANCE_ADDRESS}/api/technology/speaker-identification-voiceprint-extraction"

media_file = "john_doe.wav"

with open(media_file, mode="rb") as file:
files = {"file": file}
start_task_response = requests.post(
url=MEDIA_FILE_BASED_ENDPOINT_URL,
files=files,
)
print(start_task_response.status_code) # Should print '202'

If the task was successfully accepted, 202 code will be returned together with a unique task ID in the response body. The task isn't immediately processed, but only scheduled for processing. You can check the current task status whilst polling for the result.

Polling

To obtain the final result, periodically query the task status until the task state changes to done, failed or rejected. The general polling procedure is described in detail in the Task lifecycle code examples.

Result for Voiceprint Extraction

The result field of the task contains channels list of independent results for each channel, identified by its channel_number. Each channel contains:

  • voiceprint: A Base64-encoded string of the extracted voiceprint.
  • speech_length: Length of the speech in seconds used for extraction.
  • model: A string representing the model used for extraction.

Example task result of a successful Voiceprint Extraction from a stereo file (shortened for readability):

{
"task": {"task_id": "fb9de4e5-a768-4069-aff3-c74c826f3ddf", "state": "done"},
"result": {
"channels": [
{
"channel_number": 0,
"voiceprint": "e2kDY3JjbDAWiyhpCWVtYmVkZGluZ1tkO/QWvmS8JkuGZDyv+F5kvJQzJ...",
"speech_length": 49.08,
"model": "sid-xl5",
},
{
"channel_number": 1,
"voiceprint": "e2kDY3JjbFL6NSxpCWVtYmVkZGluZ1tkO2ygcGS8CduAZDxa6ZBkPHrYf...",
"speech_length": 116.35,
"model": "sid-xl5",
},
]
},
}

Let's get back to our example with the mono-channel media files. In our case, the target voiceprint can be accessed as follows (from the polling step):

known_audio_voiceprint = polling_task_response_json["result"]["channels"][0]["voiceprint"]

Now, you can repeat the same process for the second file, just change john_doe.wav to unknown.wav and store the resulting voiceprint in another variable:

unknown_audio_voiceprint = polling_task_response_json["result"]["channels"][0]["voiceprint"]

Run Voiceprint Comparison

When both voiceprints are extracted, we can move on to the actual comparison. Two non-empty lists of voiceprints are the mandatory fields in the request body. Each voiceprint is expected to be a Base64-encoded string, but you don't have to worry about it -- the voiceprints are already returned in this format from the Voiceprint Extraction.

Doing the Voiceprint Comparison is analogous to Voiceprint Extraction -- we start by requesting the Voiceprint Comparison task to be scheduled for our two voiceprint lists:

import requests

VIRTUAL_APPLIANCE_ADDRESS = "http://<virtual-appliance-address>:8000" # Replace with your address
VOICEPRINT_BASED_ENDPOINT_URL = f"{VIRTUAL_APPLIANCE_ADDRESS}/api/technology/speaker-identification-voiceprint-comparison"

body = {
"voiceprints_a": [known_audio_voiceprint],
"voiceprints_b": [unknown_audio_voiceprint],
}

start_task_response = requests.post(
url=VOICEPRINT_BASED_ENDPOINT_URL,
json=body,
)
print(start_task_response.status_code) # Should print '202'

Follow the polling process as described in Task lifecycle code examples to get the result of the Voiceprint Comparison.

Result for Voiceprint Comparison

The result field is a row-major flat list of comparison scores, representing a matrix with shape based on the sizes of both input voiceprint lists. In case of our current Speaker Verification, the resulting list has a length of 1, representing a matrix with shape of 1x1.

For our sample voiceprints, the task result should look as follows:

{
"task": {
"task_id": "e44557e1-94ba-4272-929a-8a5ec32f6e96",
"state": "done"
},
"result": {
"scores": {
"rows_count": 1,
"columns_count": 1,
"values": [2.1514739990234375]
}
}
}

In this case, we can see that the resulting score is very high, therefore we can assume that it is very likely that John Doe is also speaking in the unknown.wav file. See scoring explained here.

Full Python Code

Here is the full code for this example, slightly adjusted and wrapped into functions for better readability. Refer to the Task lifecycle code examples for a generic code template, applicable to all technologies.

import json
import requests
import time

VIRTUAL_APPLIANCE_ADDRESS = "http://<virtual-appliance-address>:8000" # Replace with your address

MEDIA_FILE_BASED_ENDPOINT_URL = f"{VIRTUAL_APPLIANCE_ADDRESS}/api/technology/speaker-identification-voiceprint-extraction"
VOICEPRINT_BASED_ENDPOINT_URL = f"{VIRTUAL_APPLIANCE_ADDRESS}/api/technology/speaker-identification-voiceprint-comparison"


def poll_result(polling_url, polling_interval=5):
"""Poll the task endpoint until processing completes."""
while True:
polling_task_response = requests.get(polling_url)
polling_task_response.raise_for_status()
polling_task_response_json = polling_task_response.json()
task_state = polling_task_response_json["task"]["state"]
if task_state in {"done", "failed", "rejected"}:
break
time.sleep(polling_interval)
return polling_task_response


def run_media_based_task(media_file, params=None, config=None):
"""Create a media-based task and wait for results."""
if params is None:
params = {}
if config is None:
config = {}

with open(media_file, mode="rb") as file:
files = {"file": file}
start_task_response = requests.post(
url=MEDIA_FILE_BASED_ENDPOINT_URL,
files=files,
params=params,
data={"config": json.dumps(config)},
)
start_task_response.raise_for_status()
polling_url = start_task_response.headers["Location"]
task_result = poll_result(polling_url)
return task_result.json()


def run_voiceprint_based_task(json_payload):
"""Create a voiceprint-based task and wait for results."""
start_task_response = requests.post(
url=VOICEPRINT_BASED_ENDPOINT_URL,
json=json_payload,
)
start_task_response.raise_for_status()
polling_url = start_task_response.headers["Location"]
task_result = poll_result(polling_url)
return task_result.json()


known_audio = "john_doe.wav"
unknown_audio = "unknown.wav"

# Extract voiceprint for the known audio
known_audio_response = run_media_based_task(known_audio)
known_audio_voiceprint = known_audio_response["result"]["channels"][0]["voiceprint"]

# Extract voiceprint for the unknown audio
unknown_audio_response = run_media_based_task(unknown_audio)
unknown_audio_voiceprint = unknown_audio_response["result"]["channels"][0]["voiceprint"]

# Compare voiceprints
voiceprint_comparison_response = run_voiceprint_based_task(
json_payload={
"voiceprints_a": [known_audio_voiceprint],
"voiceprints_b": [unknown_audio_voiceprint],
}
)
print(voiceprint_comparison_response)