We present the ML-SUPERB 2.0 Interspeech 2025 Challenge, which aims to develop state-of-the-art Automatic Speech Recognition (ASR) and Language Identification (LID) models that are more inclusive of the world's languages and language varieties. Building upon previous SUPERB challenges, which have traditionally focused on evaluating self-supervised speech representations, this year's challenge is aimed at improving the performance and robustness of multilingual ASR and LID systems. In particular, the original ML-SUPERB Benchmark and the ML-SUPERB 2023 Challenge encouraged participants to develop and benchmark state-of-the-art speech representations across various languages. Instead, the ML-SUPERB 2.0 Challenge encourages the development of well-performing multilingual ASR and LID systems that can handle a diverse range of languages and dialects. The design of the challenge aligns with the objectives of the ML-SUPERB 2.0 benchmark, which encourages research on methods that leave no language ”behind”.
We also encourage participants to check out the Faetar Challenge, which focuses on a specific application of low-resource/dialectal ASR.
In summary, the challenge:
Please join our Slack workspace for further discussions!
We invite participants to develop systems that both identify the language of a spoken utterance (LID) and transcribe the content of the speech into text (ASR). The list of languages that need to be supported can be found here. We note that for some utterances, we will also provide the identity of the language being spoken. This is for the sake of running analyses on the effects of LID errors, and will not affect scoring. As such, participants will provide the following function as their submission.
1 def API(waveform, true_lid=None):
2 """
3 Args:
4 waveform (np.array): 16kHz 1D speech waveform
5 true_lid (str): ISO3 code of true language id, or None
6
7 Returns:
8 pred_lid (str): ISO3 code of predicted language id
9 pred_asr (str): predicted transcript in all caps, no punctuation
10 """
11
12 # the case where we do not know the language being spoken
13 if true_lid is None:
14 pred_lid =
15 pred_asr =
16 # we know the language being spoken
17 else:
18 pred_lid = true_lid
19 pred_asr =
20
21 return pred_lid, pred_asr
Participants will fill in the API function to call their model(s) for LID prediction and/or ASR. While the LID and ASR models can be completely independent, the LID prediction must be in lowercase and formatted using the ISO 639-3 language code enclosed in square brackets (e.g., [eng] for English).
Participants are allowed to use any existing or collected dataset, as long as they report where/how the data was collected.
Participants are not allowed to download the test data from the evaluation server.
We provide the ML-SUPERB 2.0 public set as a baseline dataset for both training and development (recommended but not required).
Data links: Google Drive, Huggingface
This baseline dataset is sourced from a variety of multilingual speech corpora (e.g., FLEURS, Common Voice, Multilingual LibriSpeech, etc.), and it covers 141 of the required 153 languages:
We also release a development set for various language varieties, which covers 56 dialects and accents (recommended but not required). Data links: Google Drive and Huggingface
Please find detailed descriptions for the data, especially the development data, in the data description page.
Important notes:
Acknowledgement: we thank Tanel Alumäe for pointing out very important issues which leads to changes mentioned above.
In summary:
Participants are allowed to use any pre-trained model (such as LLMs) or modelling technique they desire. The only restriction for models is that they:
We provide a baseline model and training recipe to aid model development. The baseline model is an MMS model with 1 billion parameters, which is fine-tuned on the ML-SUPERB 2.0 training set using ESPnet. The MMS model is frozen during fine-tuning, and a weighted sum of its layers is fed into a simple 2-layer Transformer encoder, which is trained using the CTC loss. The total number of trainable parameters is 6.36M. Details about the training configuration can be found here. The total training time is around 48 hours using a single H100 80GB GPU and 96 hours using a single A100 40GB GPU. If you use a smaller pre-trained model, such as MMS 316M or XLS-R 316M, the training time is approximately 24 hours on a single A100. With CTC greedy search, inference on all of the dev/test sets required around 2.5-3 A100 hours and consumed at most 8GB of VRAM. Below are the scores of the MMS baseline on the dev/test sets.
Model | Standard CER | Standard LID | Worst 15 CER | CER StD | Dialect CER | Dialect LID |
---|---|---|---|---|---|---|
MMS 1B CTC | 24.0 | 74.0 | 71.1 | 25.5 | 32.7 | 54.0 |
We also plan to release a stronger baseline system based on OWSM. Please check back later for details.
Models will be evaluated on two test sets:
Inference and evaluation will be automatically performed on the server, so participants will not be allowed access to the raw audio nor raw model outputs. The overall ranking will be determined by a submission's average ranking across six evaluation metrics:
CER is calculated using unpunctuated upper-case text. Spaces are considered, except on languages that do not have space-delimited word boundaries (Chinese, Japanese, Thai). Here is the code that we will use for the normalization:
1 import re # Python regex
2 import string # Python string manipulation library
3 import unicodedata # unicode punctuation detection
4 from jiwer import cer # CER implementation
5
6 def remove_punctuation(sentence):
7
8 new_sentence = ""
9 for char in sentence:
10 # all unicode punctuation is of type P
11 if unicodedata.category(char).startswith('P'):
12 continue
13 else:
14 new_sentence = f"{new_sentence}{char}"
15 return new_sentence
16
17 def normalize_and_calculate_cer(hyp: str, ref: str, remove_spaces: bool):
18
19 # remove space for Chinese/Japanese/Thai
20 if remove_spaces:
21 hyp = re.sub(r"\s", "", hyp)
22 ref = re.sub(r"\s", "", ref)
23
24 # remove punctuation
25 hyp = remove_punctuation(hyp)
26 ref = remove_punctuation(ref)
27
28 # upper case everything
29 hyp = hyp.upper()
30 ref = ref.upper()
31
32 return cer(hyp, ref)
The full implementation can be found here. For example, the sentence "I'll be going to the CMU campus." will be converted to "ILL BE GOING TO THE CMU CAMPUS". If we have a hypothesis from an ASR system that is "ill be going to the see them you campus", it will be converteed to "ILL BE GOING TO THE SEE THEM YOU CAMPUS". Before normalization, the CER would be 46.85%. After normalization, it would be 33.33%.
For a language that does not use whitespaces, such as Chinese, an example normalization will convert "我想去餐厅 我非常饿" to 我想去餐厅我非常饿".
We will not ask participants to identify the dialect spoken in an utterance (eg. AAVE vs British English), but they are free to use that information in their systems internally.
If a tie-breaker is necessary, we will use the average of the raw metric scores.
The ranking methodology can be summarized as follows:
Here is an example ranking based on some baseline models that we trained using the ML-SUPERB 1-hour baseline dataset. Please note that this example is provided for demonstration purposes only.
Model | Standard CER | Standard LID | Worst 15 CER | CER StD | Dialect CER | Dialect LID |
---|---|---|---|---|---|---|
XEUS | 22.4 | 77.1 | 78.9 | 30.9 | 23.2 | 79.1 |
MMS 1B | 24.0 | 74.0 | 71.1 | 25.5 | 32.7 | 54.0 |
w2v-BERT 2.0 | 28.2 | 76.6 | 81.8 | 26.7 | 40.9 | 45.9 |
XLS-R 128 1B | 34.2 | 70.9 | 98.3 | 28.9 | 32.7 | 58.1 |
XLS-R 128 300M | 31.7 | 72.4 | 86.8 | 26.9 | 30.5 | 60.7 |
XLSR 53 | 38.3 | 63.6 | 93.3 | 26.9 | 23.3 | 71.7 |
WavLM | 47.2 | 55.9 | 131.8 | 36.4 | 27.2 | 77.8 |
Model | Standard CER | Standard LID | Worst 15 CER | CER StD | Dialect CER | Dialect LID | Avg. Ranking | Final Ranking |
---|---|---|---|---|---|---|---|---|
XEUS | 1 | 1 | 2 | 6 | 1 | 1 | 2.0 | 1 |
MMS 1B | 2 | 3 | 1 | 1 | 5 | 6 | 3.0 | 2 |
w2v-BERT 2.0 | 3 | 2 | 3 | 2 | 7 | 7 | 4.0 | 4 |
XLS-R 128 1B | 5 | 5 | 6 | 5 | 5 | 5 | 5.1 | 6 |
XLS-R 128 300M | 4 | 4 | 4 | 3 | 4 | 4 | 3.8 | 3 |
XLSR 53 | 6 | 6 | 5 | 3 | 2 | 3 | 4.1 | 5 |
WavLM | 7 | 7 | 7 | 7 | 3 | 2 | 5.5 | 7 |
Participants will upload their API definition and model weights to DynaBench, where inference and scoring will be performed automatically. There is a live leaderboard on DynaBench where we display the rankings of submissions. A tutorial on submitting to DynaBench can be found here. Example submission scripts can also be found here. If you have any questions regarding the submission procedure, please ask questions in our Slack workspace.
The leaderboard will be frozen for challenge ranking purposes on 2/12 11:59AM AoE (12 hours between Interspeech initial submission deadline). Inference must be finished before the deadline to appear in the final leaderboard. We expect inference to take 8-24 hours, depending on the submitted model size and architecture. For a 300M/1B CTC model, inference took 15/21 hours respectively. Each team may submit multiple models. To rank teams against each other, we will only consider the best performing model. Please make sure to include your team details (authors, team name) in the description section of each submission. Participants may continue to submit after the 2/12 deadline to perform additional ablations and experiments, but those submissions will not affect the final challenge ranking.
Importantly, we allow a maximum of three submissions to the leaderboard per day (this may change in the future depending on the server load). Please note that all model checkpoints will be deleted from the evaluation server after inference. We may use the transcriptions generated by the submitted systems for further analyses, but we will not release them without the authors' permission.
William Chen (CMU)
Jiatong Shi (CMU)
Shih-Heng Wang (CMU)
Shinji Watanabe (CMU)
Antonios Anastasopoulos (GMU)
Chutong Meng (GMU)
Martijn Bartelds (Stanford)
Dan Jurafsky (Stanford)
Hung-yi Lee (NTU)
Hsiu-Hsuan Wang (NTU)
Karen Livescu (TTIC)