Speaker Verification
This guide demonstrates how to perform Speaker Verification with Phonexia Speech Platform 4 Virtual Appliance with Voiceprint Extraction and Voiceprint Comparison.
The objective of Speaker Verification is to confirm or deny whether a speaker in a media file is the same person as they claim to be. We already have a media file with the voice of John Doe. Now, we want to verify that it's John Doe speaking in another media file. The process of verification consists of Voiceprint Extraction and Voiceprint Comparison. Both steps are explained in the following sections.
Attached, you will find a ZIP file named recordings.zip
containing two mono-channel media files – john_doe.wav and unknown.wav,
which will be used as examples throughout the guide.
At the end of this guide, you'll find the full Python code example that combines all the steps that will first be discussed separately. This guide should give you a comprehensive understanding on how to perform Speaker Verification in your own projects.
Prerequisites
Follow the prerequisites for setup of Virtual Appliance and Python environment as described in the Task lifecycle code examples.
Run Voiceprint Extraction
To run Voiceprint Extraction for a single media file, you should start by
sending a POST request to the
/api/technology/speaker-identification-voiceprint-extraction
endpoint. file is the only mandatory parameter. In Python, you can do this as
follows:
import requests
VIRTUAL_APPLIANCE_ADDRESS = "http://<virtual-appliance-address>:8000" # Replace with your address
MEDIA_FILE_BASED_ENDPOINT_URL = f"{VIRTUAL_APPLIANCE_ADDRESS}/api/technology/speaker-identification-voiceprint-extraction"
media_file = "john_doe.wav"
with open(media_file, mode="rb") as file:
files = {"file": file}
start_task_response = requests.post(
url=MEDIA_FILE_BASED_ENDPOINT_URL,
files=files,
)
print(start_task_response.status_code) # Should print '202'
If the task was successfully accepted, 202 code will be returned together with a
unique task ID in the response body. The task isn't immediately processed, but
only scheduled for processing. You can check the current task status whilst
polling for the result.
Polling
To obtain the final result, periodically query the task status until the task
state changes to done, failed or rejected. The general polling procedure
is described in detail in the
Task lifecycle code examples.
Result for Voiceprint Extraction
The result field of the task contains channels list of independent results
for each channel, identified by its channel_number. Each channel contains:
voiceprint: A Base64-encoded string of the extracted voiceprint.speech_length: Length of the speech in seconds used for extraction.model: A string representing the model used for extraction.
Example task result of a successful Voiceprint Extraction from a stereo file (shortened for readability):
{
"task": {"task_id": "fb9de4e5-a768-4069-aff3-c74c826f3ddf", "state": "done"},
"result": {
"channels": [
{
"channel_number": 0,
"voiceprint": "e2kDY3JjbDAWiyhpCWVtYmVkZGluZ1tkO/QWvmS8JkuGZDyv+F5kvJQzJ...",
"speech_length": 49.08,
"model": "sid-xl5",
},
{
"channel_number": 1,
"voiceprint": "e2kDY3JjbFL6NSxpCWVtYmVkZGluZ1tkO2ygcGS8CduAZDxa6ZBkPHrYf...",
"speech_length": 116.35,
"model": "sid-xl5",
},
]
},
}
Let's get back to our example with the mono-channel media files. In our case, the target voiceprint can be accessed as follows (from the polling step):
known_audio_voiceprint = polling_task_response_json["result"]["channels"][0]["voiceprint"]
Now, you can repeat the same process for the second file, just change
john_doe.wav to unknown.wav and store the resulting voiceprint in another
variable:
unknown_audio_voiceprint = polling_task_response_json["result"]["channels"][0]["voiceprint"]
Run Voiceprint Comparison
When both voiceprints are extracted, we can move on to the actual comparison. Two non-empty lists of voiceprints are the mandatory fields in the request body. Each voiceprint is expected to be a Base64-encoded string, but you don't have to worry about it -- the voiceprints are already returned in this format from the Voiceprint Extraction.
Doing the Voiceprint Comparison is analogous to Voiceprint Extraction -- we start by requesting the Voiceprint Comparison task to be scheduled for our two voiceprint lists:
import requests
VIRTUAL_APPLIANCE_ADDRESS = "http://<virtual-appliance-address>:8000" # Replace with your address
VOICEPRINT_BASED_ENDPOINT_URL = f"{VIRTUAL_APPLIANCE_ADDRESS}/api/technology/speaker-identification-voiceprint-comparison"
body = {
"voiceprints_a": [known_audio_voiceprint],
"voiceprints_b": [unknown_audio_voiceprint],
}
start_task_response = requests.post(
url=VOICEPRINT_BASED_ENDPOINT_URL,
json=body,
)
print(start_task_response.status_code) # Should print '202'
Follow the polling process as described in Task lifecycle code examples to get the result of the Voiceprint Comparison.
Result for Voiceprint Comparison
The result field is a row-major flat list of comparison scores, representing a
matrix with shape based on the sizes of both input voiceprint lists. In case of
our current Speaker Verification, the resulting list has a length of 1,
representing a matrix with shape of 1x1.
For our sample voiceprints, the task result should look as follows:
{
"task": {
"task_id": "e44557e1-94ba-4272-929a-8a5ec32f6e96",
"state": "done"
},
"result": {
"scores": {
"rows_count": 1,
"columns_count": 1,
"values": [2.1514739990234375]
}
}
}
In this case, we can see that the resulting score is very high, therefore we can
assume that it is very likely that John Doe is also speaking in the
unknown.wav file.
See scoring explained here.
Full Python Code
Here is the full code for this example, slightly adjusted and wrapped into functions for better readability. Refer to the Task lifecycle code examples for a generic code template, applicable to all technologies.
import json
import requests
import time
VIRTUAL_APPLIANCE_ADDRESS = "http://<virtual-appliance-address>:8000" # Replace with your address
MEDIA_FILE_BASED_ENDPOINT_URL = f"{VIRTUAL_APPLIANCE_ADDRESS}/api/technology/speaker-identification-voiceprint-extraction"
VOICEPRINT_BASED_ENDPOINT_URL = f"{VIRTUAL_APPLIANCE_ADDRESS}/api/technology/speaker-identification-voiceprint-comparison"
def poll_result(polling_url, polling_interval=5):
"""Poll the task endpoint until processing completes."""
while True:
polling_task_response = requests.get(polling_url)
polling_task_response.raise_for_status()
polling_task_response_json = polling_task_response.json()
task_state = polling_task_response_json["task"]["state"]
if task_state in {"done", "failed", "rejected"}:
break
time.sleep(polling_interval)
return polling_task_response
def run_media_based_task(media_file, params=None, config=None):
"""Create a media-based task and wait for results."""
if params is None:
params = {}
if config is None:
config = {}
with open(media_file, mode="rb") as file:
files = {"file": file}
start_task_response = requests.post(
url=MEDIA_FILE_BASED_ENDPOINT_URL,
files=files,
params=params,
data={"config": json.dumps(config)},
)
start_task_response.raise_for_status()
polling_url = start_task_response.headers["Location"]
task_result = poll_result(polling_url)
return task_result.json()
def run_voiceprint_based_task(json_payload):
"""Create a voiceprint-based task and wait for results."""
start_task_response = requests.post(
url=VOICEPRINT_BASED_ENDPOINT_URL,
json=json_payload,
)
start_task_response.raise_for_status()
polling_url = start_task_response.headers["Location"]
task_result = poll_result(polling_url)
return task_result.json()
known_audio = "john_doe.wav"
unknown_audio = "unknown.wav"
# Extract voiceprint for the known audio
known_audio_response = run_media_based_task(known_audio)
known_audio_voiceprint = known_audio_response["result"]["channels"][0]["voiceprint"]
# Extract voiceprint for the unknown audio
unknown_audio_response = run_media_based_task(unknown_audio)
unknown_audio_voiceprint = unknown_audio_response["result"]["channels"][0]["voiceprint"]
# Compare voiceprints
voiceprint_comparison_response = run_voiceprint_based_task(
json_payload={
"voiceprints_a": [known_audio_voiceprint],
"voiceprints_b": [unknown_audio_voiceprint],
}
)
print(voiceprint_comparison_response)