Speech Translation

This guide demonstrates how to perform Speech Translation with Phonexia Speech Platform 4. This technology is based on the Enhanced Speech to Text Built on Whisper, and its high-level description can be found in the Enhanced Speech to Text Built on Whisper article.

For testing, we'll be using the following media files. You can download them all together in the audio_files.zip archive:

filename	language name
Lenka.wav	Czech
Tatiana.wav	Russian
Xiang.wav	Mandarin Chinese
Zoltan.wav	Hungarian

At the end of this guide, you'll find the full Python code example that combines all the steps that will first be discussed separately. This guide should give you a comprehensive understanding on how to integrate Speech Translation in your own projects.

Prerequisites

Follow the prerequisites for setup of Virtual Appliance and Python environment as described in the Task lifecycle code examples.

Run Speech Translation

By default, the translation source language is detected once at the beginning of the file by the auto-detect feature and the content is translated into English. To run Speech Translation for a single media file, you should start by sending a POST request to the /api/technology/speech-translation-whisper-enhanced endpoint. file is the only mandatory parameter. In Python, you can do this as follows:

import requests

VIRTUAL_APPLIANCE_ADDRESS = "http://<virtual-appliance-address>:8000"  # Replace with your address
MEDIA_FILE_BASED_ENDPOINT_URL = f"{VIRTUAL_APPLIANCE_ADDRESS}/api/technology/speech-translation-whisper-enhanced"

media_file = "Lenka.wav"

with open(media_file, mode="rb") as file:
    files = {"file": file}
    start_task_response = requests.post(
        url=MEDIA_FILE_BASED_ENDPOINT_URL,
        files=files,
    )
print(start_task_response.status_code)  # Should print '202'

If the task has been successfully accepted, the 202 code will be returned together with a unique task ID in the response body. The task isn't processed immediately, but only scheduled for processing. You can check the current task status by polling for the result.

Polling

To obtain the final result, periodically query the task status until the task state changes to done, failed or rejected. The general polling procedure is described in detail in the Task lifecycle code examples.

Result for Speaker Translation

The result field of the task contains the one_best result, with a list of segments from all channels. Each segment contains the channel_number, start_time, end_time, language of the translation, translated text, the source_language actually used as the basis for translation, and the detected_source_language which the system detected as the actual source language.

For our sample file, the task should look as follows (shortened for readability):

{
  "task": {
    "task_id": "b1850fed-3e4b-4f2a-94c8-9a57e73abb9f",
    "state": "done"
  },
  "result":{
    "one_best": {
      "segments": [
        {
          "channel_number": 0,
          "start_time": 2.17,
          "end_time": 5.3,
          "language": "en",
          "text": "Good day, I am very happy that you are calling.",
          "source_language": "cs",
          "detected_source_language": "cs",
        },
        {
          "channel_number": 0,
          "start_time": 5.3,
          "end_time": 8.3,
          "language": "en",
          "text": "We have just opened the swimming courses,",
          "source_language": "cs",
          "detected_source_language": "cs",
        },
        {
          "channel_number": 0,
          "start_time": 8.3,
          "end_time": 12.08,
          "language": "en",
          "text": "and we have planned them like this.",
          "source_language": "cs",
          "detected_source_language": "cs",
        },
        {
          "channel_number": 0,
          "start_time": 12.08,
          "end_time": 15.41,
          "language": "en",
          "text": "The course is for one semester,",
          "source_language": "cs",
          "detected_source_language": "cs",
        },
        {
          "channel_number": 0,
          "start_time": 15.41,
          "end_time": 18.41,
          "language": "en",
          "text": "and the training takes place once or twice a week,",
          "source_language": "cs",
          "detected_source_language": "cs",
        },
        ...
    ]
  }
}

In the example result, both source_language and detected_source_language contain the same language code. However, it's possible that the values may differ in some cases -- the detailed explanation of detected_source_language can be found in the description of the result schema for the GET /api/technology/speech-translation-whisper-enhanced/:task_id request.

Run Speech Translation with Parameters

You can specify the language of the resulting translated text with the output_language query parameter. The technology further supports two mutually exclusive query parameters -- source_language and enable_language_switching. In case you know what language is used in the file, you can specify it with the source_language parameter and possibly make the translation more accurate. In case the file contains multiple languages, you can use the enable_language_switching parameter, and the source language will be re-detected every 30 seconds.

When specifying the source_language and output_language, the POST request can look as follows:

import requests

VIRTUAL_APPLIANCE_ADDRESS = "http://<virtual-appliance-address>:8000"  # Replace with your address
MEDIA_FILE_BASED_ENDPOINT_URL = f"{VIRTUAL_APPLIANCE_ADDRESS}/api/technology/speech-translation-whisper-enhanced"

params = {"source_language": "cs", "output_language": "es"}

# or enable language switching
# params = {"enable_language_switching": True, "output_language": "es"}

media_file = "Lenka.wav"

with open(media_file, mode="rb") as file:
    files = {"file": file}
    start_task_response = requests.post(
        url=MEDIA_FILE_BASED_ENDPOINT_URL,
        files=files,
        params=params
    )
print(start_task_response.status_code)  # Should print '202'

You can follow the polling steps and parsing of the results as was demonstrated in the Run Speech Translation section and then the task result would look as follows:

{
  "one_best": {
      "segments": [
          {
              "channel_number": 0,
              "start_time": 2.17,
              "end_time": 5.3,
              "language": "es",
              "text": "Buenos días, estoy muy feliz de que me llame.",
              "source_language": "cs",
              "detected_source_language": "cs"
          },
          ...
      ]
  }
}

Full Python code

Here is the full example on how to run the Speech Translation technology. The code is slightly adjusted and wrapped into functions for better readability. Refer to the Task lifecycle code examples for a generic code template, applicable to all technologies.

import json
import requests
import time

VIRTUAL_APPLIANCE_ADDRESS = "http://<virtual-appliance-address>:8000"  # Replace with your address

MEDIA_FILE_BASED_ENDPOINT_URL = f"{VIRTUAL_APPLIANCE_ADDRESS}/api/technology/speech-translation-whisper-enhanced"


def poll_result(polling_url, polling_interval=5):
    """Poll the task endpoint until processing completes."""
    while True:
        polling_task_response = requests.get(polling_url)
        polling_task_response.raise_for_status()
        polling_task_response_json = polling_task_response.json()
        task_state = polling_task_response_json["task"]["state"]
        if task_state in {"done", "failed", "rejected"}:
            break
        time.sleep(polling_interval)
    return polling_task_response


def run_media_based_task(media_file, params=None, config=None):
    """Create a media-based task and wait for results."""
    if params is None:
        params = {}
    if config is None:
        config = {}

    with open(media_file, mode="rb") as file:
        files = {"file": file}
        start_task_response = requests.post(
            url=MEDIA_FILE_BASED_ENDPOINT_URL,
            files=files,
            params=params,
            data={"config": json.dumps(config)},
        )
        start_task_response.raise_for_status()
    polling_url = start_task_response.headers["Location"]
    task_result = poll_result(polling_url)
    return task_result.json()


# Run Speech Translation
media_files = [
    "Lenka.wav",
    "Tatiana.wav",
    "Xiang.wav",
    "Zoltan.wav"
]

for media_file in media_files:
    print(f"Runnning Enhanced Speech Translation Built on Whisper for file {media_file}.")
    media_file_based_task = run_media_based_task(media_file)
    media_file_based_task_result = media_file_based_task["result"]
    print(json.dumps(media_file_based_task_result, indent=2))

Prerequisites​

Run Speech Translation​

Polling​

Result for Speaker Translation​

Run Speech Translation with Parameters​

Full Python code​