Skip to main content

Enhanced Speech Translation Built on Whisper

Speech translation leverages Enhanced Speech to Text Built on Whisper to provide machine translation of speech present in the audio. If the speech in the audio is in supported language, users can translate it into text in chosen language.

Supported languages

To see the complete list of more than 60 supported audio (source) languages and 16 translation (target) languages visit this documentation page

Speech translation technology supports multi-channel audio files. Resulting translations contain details about the processed channels, detected source languages and timestamps of the utterances. There is also a possibility to specify the source languages manually. Only licensed languages can be used for translation. To achieve reasonable translation speeds, GPU is required.

Translation to other languages

Using Speech Translation built on Whisper, it is only possible to translate speech into English text. To translate the speech into other languages, it is necessary to use a third-party provider. Thus, translating speech to other languages than English is performed in two phases:

  1. Source language → English (via Whisper-based Enhanced Speech to Text)
  2. English → Target language (via a third-party provider)

The third-party provider used for translation is Argos Translate. Among 16 target languages the users can find Chinese, Arabic, German, French, Spanish, or Russian. You can see the full list of translation languages here.

Note

Because this is a two-step process, there may be cases where nuances or context are partially lost between translations.

Language switching

In the default auto-detect mode, the first 30 seconds are used to detect the language used for the translation of the whole recording. This behavior might negatively affect the resulting translations if parts of the recording after the 30-second mark are in a different language. By using the optional parameter language switching, the behavior is modified in a way that source languages are detected in 30-second segments. More details about language switching, including limitations, can be found in the Enhanced Speech to Text build on Whisper article.

FAQ

Why is the translation inaccurate?

  1. The audio has been translated from a different language than the original language of the recording because it’s not part of Phonexia’s portfolio.
  2. The audio quality is very low, or the speech is not understandable from the recording.
  3. There is background noise or music that deteriorates recording quality.
  4. When translating into non-English languages, the system uses a two-step process (via English as an intermediate step). In such cases, some details or nuances may be lost in translation.

How can I improve the processing speed?

Make sure you’re running Speech Translation on GPU to speed up the processing.