Skip to main content

Speech to Text: start task

POST 

/api/technology/speech-to-text

Start a Speech to Text task for a media file.

Speech to Text features

  • Multi-channel audio files are supported.
  • Channel id is included in individual transcription segments.
  • The built-in vocabulary can be extended using config field of multipart/form-data. The value of config is a string in JSON format.

Fine-tuning the transcription

It is possible to gain finer control over the Speech to Text transcription in two ways:

  1. Specify an array of preferred phrases, which will be prioritized in ambiguous cases.
  2. Extend the built-in vocabulary by providing an array of additional words.

These two options can be used either in tandem or independently.

Preferred phrases

The Speech to Text technology is trained to prefer words that appear more frequently in a given context. For example, even if the words "sell" and "cell" sound exactly the same, the immediate context makes it clear that in the phrase "I'm going to sell my car" the speaker could hardly mean "I'm going to cell my car". Other cases may not be as clear, though. Consider the phrase: "He is a miner". Depending on context, miner might very well be minor.

Preferred phrases allow you to leverage your unique knowledge of recording's context and prompt the technology with utterances that are expected to appear in the speech. This is especially helpful for transcribing predictable or domain-specific conversations.

Additional words

You can use this option to:

  1. Extend the technology's built-in vocabulary with completely new words.
  2. Add new (e.g., region- or country-specific) pronunciations to words already in the vocabulary.

This is especially useful for industry-specific terms, foreign words, slang or neologisms and can improve the positive effect of preferred phrases. If a word used in preferred phrases doesn't have an explicitly specified pronunciation, one of two things happens:

  1. The word is included in the technology's vocabulary and a built-in pronunciation is used.
  2. The word is not included and the technology generates a pronunciation based on the word's written form. This may result in an incorrect pronunciation, especially for words with unusual spelling. Therefore, relying on auto-generated pronunciation is discouraged.

Request

Responses

Speech to Text task was accepted. Follow the Location header to poll for the task state.

Response Headers
    X-Location

    ⚠️ Deprecated - use Location header instead.

    Example: /api/technology/speech-to-text/123e4567-e89b-12d3-a456-426614174000
    Location

    A URL the client should poll for task state and result.

    Example: /api/technology/speech-to-text/123e4567-e89b-12d3-a456-426614174000