Language Identification
This guide demonstrates how to perform Language Identification with Phonexia Speech Platform 4. You can find a high-level description in the Language Identification article.
For testing, we'll be using the following media files. You can download them all together in the audio_files.zip archive.
| filename | language code | language name |
|---|---|---|
| Adedewe.wav | yo | Yoruba |
| Dina.wav | arb | Arabic (MSA) |
| Fadimatu.wav | ha | Hausa |
| Harry.wav | en-GB | British English |
| Juan.wav | es-XA | Spanish (American) |
| Julia.wav | en-US | US English |
| Lenka.wav | cs-CZ | Czech |
| Lubica.wav | sk-SK | Slovak |
| Luka.wav | hbs | Serbo-Croatian |
| Nirav.wav | gu-IN | Gujarati |
| Noam.wav | he-IL | Hebrew |
| Obioma.wav | ig-NG | Igbo |
| Tatiana.wav | ru-RU | Russian |
| Thida.wav | km-KH | Khmer |
| Tuan.wav | vi-VN | Vietnamese |
| Xiang.wav | zh-CN | Mandarin Chinese |
| Zoltan.wav | hu-HU | Hungarian |
At the end of this guide, you'll find the full Python code example that combines all the steps that will first be discussed separately. This guide should give you a comprehensive understanding on how to integrate Language Identification in your own projects.
Prerequisites
Follow the prerequisites for setup of Virtual Appliance and Python environment as described in the Task lifecycle code examples.
Run Language Identification
To run Language Identification for a single media file, you should start by
sending a POST request to the
/api/technology/language-identification
endpoint. file is the only mandatory parameter. In Python, you can do this as
follows:
import requests
VIRTUAL_APPLIANCE_ADDRESS = "http://<virtual-appliance-address>:8000" # Replace with your address
MEDIA_FILE_BASED_ENDPOINT_URL = f"{VIRTUAL_APPLIANCE_ADDRESS}/api/technology/language-identification"
media_file = "Harry.wav"
with open(media_file, mode="rb") as file:
files = {"file": file}
start_task_response = requests.post(
url=MEDIA_FILE_BASED_ENDPOINT_URL,
files=files,
)
print(start_task_response.status_code) # Should print '202'
If the task has been successfully accepted, the 202 code will be returned
together with a unique task ID in the response body. The task isn't processed
immediately, but only scheduled for processing. You can check the current task
status by polling for the result.
Polling
To obtain the final result, periodically query the task status until the task
state changes to done, failed or rejected. The general polling procedure
is described in detail in the
Task lifecycle code examples.
Result for Language Identification
The result field of the task contains information about individual input media
channels which can be identified by their channel_number.
By default, the result contains scores for more than a hundred languages. The
following JSON is a manually shortened result of a successful Language
Identification task for the Harry.wav file which shows that the language was
correctly identified as British English ("en-GB") with the probability close
to 1.0, and that Australian English ("en-AU") also received some "points" in
contrast to Greek ("el-GR"). You can find the meaning of individual language
tags in the list of
supported languages.
{
"task": {
"task_id": "cccd6bf9-9c8c-44a3-9373-c0182fc096b4",
"state": "done"
},
"result": {
"channels": [
{
"channel_number": 0,
"speech_length": 30.0,
"scores": [
...
{
"identifier": "el-gr",
"identifier_type": "language",
"probability": 0.0
},
{
"identifier": "en-au",
"identifier_type": "language",
"probability": 0.00212
},
{
"identifier": "en-gb",
"identifier_type": "language",
"probability": 0.99787
},
...
]
}
]
}
}
You can easily parse the result and select for example only the three
top-scoring languages in the first channel (those with the highest
probability), print them to the console and save them to a file like this:
import json
scores = polling_task_response_json["result"]["channels"][0]["scores"]
top_scores = sorted(scores, key=lambda x: x["probability"], reverse=True)[:3]
print(top_scores)
with open("output.json", "w") as output_file:
json.dump(top_scores, output_file, indent=2)
This will produce the following JSON array:
[
{
"identifier": "en-gb",
"identifier_type": "language",
"probability": 0.99787
},
{
"identifier": "en-au",
"identifier_type": "language",
"probability": 0.00212
},
{
"identifier": "ab-ge",
"identifier_type": "language",
"probability": 0.0
}
]
Run Language Identification with Parameters
If you want to have more control over the output, you can use the config
request body field in which you can limit the list of languages that will be
shown in the output, and you can define language_groups that will make certain
languages be treated as a single result item. See the
endpoint documentation
for more details.
In the following example, we're using the config field to instruct the
Language Identification technology to limit the list of languages to German,
English, and Dutch (related Germanic languages), and to treat all available
dialects of English as one group:
import json
import requests
VIRTUAL_APPLIANCE_ADDRESS = "http://<virtual-appliance-address>:8000" # Replace with your address
MEDIA_FILE_BASED_ENDPOINT_URL = f"{VIRTUAL_APPLIANCE_ADDRESS}/api/technology/language-identification"
media_file = "Harry.wav"
config = {
"config": json.dumps(
{
"languages": ["de", "en-AU", "en-GB", "en-IN", "en-US", "nl"],
"language_groups": [
{
"identifier": "English",
"languages": ["en-AU", "en-GB", "en-IN", "en-US"],
}
],
}
)
}
with open(media_file, mode="rb") as file:
files = {"file": file}
start_task_response = requests.post(
url=MEDIA_FILE_BASED_ENDPOINT_URL,
data=config,
files=files,
)
print(start_task_response.status_code) # Should print '202'
After polling for the result as in the previous example we'll get the following
task result. Notice that the "English" group now received the maximum possible
probability of 1.0 and we can see how much the individual dialects contributed
to the overall score:
[
{
"identifier": "English",
"identifier_type": "group",
"probability": 1.0,
"languages": [
{
"identifier": "en-au",
"identifier_type": "language",
"probability": 0.00212
},
{
"identifier": "en-gb",
"identifier_type": "language",
"probability": 0.99788
},
{
"identifier": "en-in",
"identifier_type": "language",
"probability": 0.0
},
{
"identifier": "en-us",
"identifier_type": "language",
"probability": 0.0
}
]
},
{
"identifier": "de",
"identifier_type": "language",
"probability": 0.0
},
{
"identifier": "nl",
"identifier_type": "language",
"probability": 0.0
}
]
Full Python Code
Here is the full example on how to run the Language Identification technology with parameters that limit the list of input languages to just those that are actually spoken in the sample dataset (plus some more English dialects). The code is slightly adjusted and wrapped into functions for better readability. Refer to the Task lifecycle code examples for a generic code template, applicable to all technologies.
⚠️ Warning: If you use both the
languagesandlanguage_groupsparameters, make sure that all individual languages in a group are also included in the globallanguageslist. The example also shows that a language group can contain any language (e.g., "Czech" and "Slovak"), not just dialects of one language.
The top_scores.json file contains the result of the Python code:
import json
import requests
import time
VIRTUAL_APPLIANCE_ADDRESS = "http://<virtual-appliance-address>:8000" # Replace with your address
MEDIA_FILE_BASED_ENDPOINT_URL = f"{VIRTUAL_APPLIANCE_ADDRESS}/api/technology/language-identification"
def poll_result(polling_url, polling_interval=5):
"""Poll the task endpoint until processing completes."""
while True:
polling_task_response = requests.get(polling_url)
polling_task_response.raise_for_status()
polling_task_response_json = polling_task_response.json()
task_state = polling_task_response_json["task"]["state"]
if task_state in {"done", "failed", "rejected"}:
break
time.sleep(polling_interval)
return polling_task_response
def run_media_based_task(media_file, params=None, config=None):
"""Create a media-based task and wait for results."""
if params is None:
params = {}
if config is None:
config = {}
with open(media_file, mode="rb") as file:
files = {"file": file}
start_task_response = requests.post(
url=MEDIA_FILE_BASED_ENDPOINT_URL,
files=files,
params=params,
data={"config": json.dumps(config)},
)
start_task_response.raise_for_status()
polling_url = start_task_response.headers["Location"]
task_result = poll_result(polling_url)
return task_result.json()
# Run Language Identification
media_files = [
"Adedewe.wav",
"Dina.wav",
"Fadimatu.wav",
"Harry.wav",
"Juan.wav",
"Julia.wav",
"Lenka.wav",
"Lubica.wav",
"Luka.wav",
"Nirav.wav",
"Noam.wav",
"Obioma.wav",
"Tatiana.wav",
"Thida.wav",
"Tuan.wav",
"Xiang.wav",
"Zoltan.wav",
]
config = {
"languages":[
"arb",
"cs-CZ",
"en-AU",
"en-GB",
"en-IN",
"en-US",
"es-XA",
"gu-IN",
"ha",
"hbs",
"he-IL",
"hu-HU",
"ig-NG",
"km-KH",
"ru-RU",
"sk-SK",
"vi-VN",
"yo",
"zh-CN"
],
"language_groups":[
{
"identifier":"English",
"languages":[
"en-AU",
"en-GB",
"en-IN",
"en-US"
]
},
{
"identifier":"Czecho-Slovak",
"languages":[
"cs-CZ",
"sk-SK"
]
}
]
}
media_file_based_results = {}
for media_file in media_files:
print(f"Running Language Identification for file {media_file}.")
media_file_based_task = run_media_based_task(media_file, config=config)
scores = media_file_based_task["result"]["channels"][0]["scores"]
top_scores = sorted(scores, key=lambda x: x["probability"], reverse=True)[:3]
media_file_based_results[media_file] = top_scores
print(f"The top-scoring languages in {media_file} are: {top_scores}")
with open("top_scores.json", "w") as output_file:
json.dump(results, output_file, indent=2)