Speaker Diarization

This guide demonstrates how to perform Speaker Diarization with Phonexia Speech Platform 4 Virtual Appliance. You can find a high-level description in the Speaker Diarization article. The technology can diarize audio data into segments, based on who is speaking.

For testing, we'll be using the following media files. You can download them all together in the audio_files.zip archive.

filename	channels	number of speakers in each channel (channels are separated by comma)
Kathryn_Paula.wav	mono	2
Laura_Harry_Veronika.wav	stereo	1, 2
Laura_Ivy_Kathryn.wav	mono	3
Veronika_Harry.wav	mono	2

At the end of this guide, you'll find the full Python code example that combines all the steps that will first be discussed separately. This guide should give you a comprehensive understanding on how to integrate Speaker Diarization in your own projects.

Prerequisites

Follow the prerequisites for setup of Virtual Appliance and Python environment as described in the Task lifecycle code examples.

Run Speaker Diarization

To run Speaker Diarization for a single media file, you should start by sending a POST request to the /api/technology/speaker-diarization endpoint. file is the only mandatory parameter. In Python, you can do this as follows:

import requests

VIRTUAL_APPLIANCE_ADDRESS = "http://<virtual-appliance-address>:8000"  # Replace with your address
MEDIA_FILE_BASED_ENDPOINT_URL = f"{VIRTUAL_APPLIANCE_ADDRESS}/api/technology/speaker-diarization"

media_file = "Kathryn_Paula.wav"

with open(media_file, mode="rb") as file:
    files = {"file": file}
    start_task_response = requests.post(
        url=MEDIA_FILE_BASED_ENDPOINT_URL,
        files=files,
    )
print(start_task_response.status_code)  # Should print '202'

If the task has been successfully accepted, the 202 code will be returned together with a unique task ID in the response body. The task isn't processed immediately, but only scheduled for processing. You can check the current task status by polling for the result.

Polling

To obtain the final result, periodically query the task status until the task state changes to done, failed or rejected. The general polling procedure is described in detail in the Task lifecycle code examples.

Result for Speaker Diarization

The result field of the task contains channels list of independent results for each channel, identified by its channel_number. Each channel contains speakers_count field indicating the number of speakers detected in the channel, and list of diarized segments.

For our sample file, the task result should look as follows (shortened for readability):

{
  "task": {
    "task_id": "123e4567-e89b-12d3-a456-426614174000",
    "state": "done"
  },
  "result": {
    "channels": [
      {
        "channel_number": 0,
        "speakers_count": 2,
        "segments": [
          {
            "speaker_id": 0,
            "start_time": 1.51,
            "end_time": 13.99
          },
          {
            "speaker_id": 1,
            "start_time": 13.99,
            "end_time": 14.12
          },
          {
            "speaker_id": 1,
            "start_time": 14.63,
            "end_time": 25.51
          },
          {
            "speaker_id": 1,
            "start_time": 25.7,
            "end_time": 25.94
          },
          {
            "speaker_id": 0,
            "start_time": 1.51,
            "end_time": 13.99
          },
          ...
        ]
      }
    ]
  }
}

Run Speaker Diarization with Parameters

If you want to have more control over the technology result, you can use the config request body field, in which you can specify one of the mutually exclusive fields max_speakers, and total_speakers. The max_speakers field defines the upper boundary of how many speakers may be identified in the media file, potentially making the result more accurate. If no value of max_speakers is set, the technology uses the default value of 100.

On the other hand, the total_speakers field sets the exact value of how many speakers must be identified. Use this field only if you are sure about the number of speakers in the media file, because the technology always diarizes the media file accordingly, even if the actual number of speakers is different.

You can use the config payload in the POST request as follows:

import json
import requests

VIRTUAL_APPLIANCE_ADDRESS = "http://<virtual-appliance-address>:8000"  # Replace with your address
MEDIA_FILE_BASED_ENDPOINT_URL = f"{VIRTUAL_APPLIANCE_ADDRESS}/api/technology/speaker-diarization"

config = {"config": json.dumps({"max_speakers": 5})}

# or
# config= {"config": json.dumps({"total_speakers": 2})}

media_file = "Kathryn_Paula.wav"

with open(media_file, mode="rb") as file:
    files = {"file": file}
    start_task_response = requests.post(
        url=MEDIA_FILE_BASED_ENDPOINT_URL,
        data=config,
        files=files,
    )
print(start_task_response.status_code)  # Should print '202'

You can follow the polling steps and parsing of the results as was demonstrated in the Run Speaker Diarization section.

Full Python code

Here is the full example on how to run the Speaker Diarization technology. The code is slightly adjusted and wrapped into functions for better readability. Refer to the Task lifecycle code examples for a generic code template, applicable to all technologies.

import json
import requests
import time

VIRTUAL_APPLIANCE_ADDRESS = "http://<virtual-appliance-address>:8000"  # Replace with your address

MEDIA_FILE_BASED_ENDPOINT_URL = f"{VIRTUAL_APPLIANCE_ADDRESS}/api/technology/speaker-diarization"


def poll_result(polling_url, polling_interval=5):
    """Poll the task endpoint until processing completes."""
    while True:
        polling_task_response = requests.get(polling_url)
        polling_task_response.raise_for_status()
        polling_task_response_json = polling_task_response.json()
        task_state = polling_task_response_json["task"]["state"]
        if task_state in {"done", "failed", "rejected"}:
            break
        time.sleep(polling_interval)
    return polling_task_response


def run_media_based_task(media_file, params=None, config=None):
    """Create a media-based task and wait for results."""
    if params is None:
        params = {}
    if config is None:
        config = {}

    with open(media_file, mode="rb") as file:
        files = {"file": file}
        start_task_response = requests.post(
            url=MEDIA_FILE_BASED_ENDPOINT_URL,
            files=files,
            params=params,
            data={"config": json.dumps(config)},
        )
        start_task_response.raise_for_status()
    polling_url = start_task_response.headers["Location"]
    task_result = poll_result(polling_url)
    return task_result.json()


# Run Speaker Diarization
media_files = [
    "Kathryn_Paula.wav",
    "Laura_Harry_Veronika.wav",
    "Laura_Ivy_Kathryn.wav",
    "Veronika_Harry.wav",
]

for media_file in media_files:
    print(f"Runnning Speaker Diarization for file {media_file}.")
    media_file_based_task = run_media_based_task(media_file)
    media_file_based_task_result = media_file_based_task["result"]
    print(json.dumps(media_file_based_task_result, indent=2))

Prerequisites​

Run Speaker Diarization​

Polling​

Result for Speaker Diarization​

Run Speaker Diarization with Parameters​

Full Python code​