Speech to Text: start task
POST/api/technology/speech-to-text
Start a Speech to Text task for a media file.
Speech to Text features
- Multi-channel audio files are supported.
- Channel id is included in individual transcription segments.
- The built-in vocabulary can be extended using
config
field ofmultipart/form-data
. The value ofconfig
is a string in JSON format.
Fine-tuning the transcription
It is possible to gain finer control over the Speech to Text transcription in two ways. Firstly, you may specify an array of preferred phrases, which will be prioritized in ambiguous cases. Secondly, you can extend the built-in vocabulary by providing an array of additional words. These two options can be used either in tandem or independently.
Preferred phrases
In case of unclear speech, the Speech to Text technology usually prefers word that makes more sense in the given context. For example, it might be impossible to determine whether a speaker said "I'm going to cell my car" or "I'm going to sell my car". However, the context suggests that the speaker is probably talking about selling their car. In other cases though, it might not be as clear. Consider the following sentence: "He bought flour in the shop". In the given context, the flour might very well be flower.
Preferred phrases allow you to leverage your unique knowledge of recording's context and prompt the technology with utterances that are expected to appear in the speech. This is especially helpful for transcribing predictable or domain-specific conversations.
Additional words
You may use this configuration option to specify pronunciation of words that occur in preferred phrases. If the words from preferred phrases don't have explicitly specified pronunciation, one of two things can happen. Either a word is known by the technology's vocabulary and a built-in (and thus precise) pronunciation is used. Or it is an unknown word and the technology does its best to generate a default pronunciation based on the word's written form. Take note that this may result in incorrect pronunciations, especially with words foreign to the recording's language. Therefore, relying on auto-generated pronunciation is discouraged.
Alternatively, this option can be used to extend the technology's built-in vocabulary by new words. This is especially useful for industry-specific terms, foreign words, slang or neologisms. Another use case is adding region- or country-specific pronunciations to otherwise known words. Again, it is not mandatory to provide a pronunciation for new words, but the same limitations as described above apply.
Request
Responses
- 202
- 400
- 403
- 413
- 422
- 429
- 507
Speech to Text task was accepted. Follow the X-Location
header to poll for the task state.
Response Headers
A URL the client should poll for task state and result.
/api/technology/speech-to-text/123e4567-e89b-12d3-a456-426614174000
Request payload data was invalid and could not be parsed.
Request is forbidden.
The request entity (payload) size exceeds the allowed limit.
Error during validation of request payload data occurred.
Request rate limit exceeded.
The request may be retried after a while. The following response headers may be checked for details: retry-after
, x-ratelimit-limit
, x-ratelimit-remaining
, x-ratelimit-reset
.
Response Headers
Header indicates how long the user agent should wait before making a follow-up request.
Size of the current rate limiting window.
Remaining number of requests in the current rate limiting window.
Time at which the current rate limiting window resets (in UTC epoch).
The storage is full and cannot accept any data.