Welcome to the Interspeech 2025 ML-SUPERB Challenge!

Slack

Pushing the state-of-the-art in multilingual ASR and LID

We present the ML-SUPERB 2.0 Interspeech 2025 Challenge, which aims to develop state-of-the-art Automatic Speech Recognition (ASR) and Language Identification (LID) models that are more inclusive of the world's languages and language varieties. Building upon previous SUPERB challenges, which have traditionally focused on evaluating self-supervised speech representations, this year's challenge is aimed at improving the performance and robustness of multilingual ASR and LID systems. In particular, the original ML-SUPERB Benchmark and the ML-SUPERB 2023 Challenge encouraged participants to develop and benchmark state-of-the-art speech representations across various languages. Instead, the ML-SUPERB 2.0 Challenge encourages the development of well-performing multilingual ASR and LID systems that can handle a diverse range of languages and dialects. The design of the challenge aligns with the objectives of the ML-SUPERB 2.0 benchmark, which encourages research on methods that leave no language ”behind”.

We also encourage participants to check out the Faetar Challenge, which focuses on a specific application of low-resource/dialectal ASR.

In summary, the challenge:

  • Has the goal of developing SOTA systems for all languages and language varieties.
  • Focuses on two tasks, namely multilingual LID and ASR.
  • Evaluates on 154 unique languages, 200+ language varieties.
  • Has a single track, evaluated by the average ranking across the following metrics:
    • Average LID accuracy across all 154 languages.
    • Average Character Error Rate (CER) across all 154 languages.
    • Average CER across the 15 worst-performing languages in the set of 154 languages.
    • Standard Deviation of CER computed for each of the 154 languages.
    • Average LID across the evaluated language varieties.
    • Average CER across the evaluated language varieties.
  • Allows using any resource for data (both supervised and unsupervised).
  • Allows almost any pre-trained model (see below).
  • Has an online leaderboard and evaluation server (more information is given below).

Slack workspace

Please join our Slack workspace for further discussions!

Task

We invite participants to develop systems that both identify the language of a spoken utterance (LID) and transcribe the content of the speech into text (ASR). The list of languages that need to be supported can be found here. We note that for some utterances, we will also provide the identity of the language being spoken. This is for the sake of running analyses on the effects of LID errors, and will not affect scoring. As such, participants will provide the following API()API() function as their submission.

1 def API(waveform, true_lid=None): 2 """ 3 Args: 4 waveform (np.array): 16kHz 1D speech waveform 5 true_lid (str): ISO3 code of true language id, or None 6 7 Returns: 8 pred_lid (str): ISO3 code of predicted language id 9 pred_asr (str): predicted transcript in all caps, no punctuation 10 """ 11 12 # the case where we do not know the language being spoken 13 if true_lid is None: 14 pred_lid = 15 pred_asr = 16 # we know the language being spoken 17 else: 18 pred_lid = true_lid 19 pred_asr = 20 21 return pred_lid, pred_asr

Participants will fill in the API function to call their model(s) for LID prediction and/or ASR. While the LID and ASR models can be completely independent, the LID prediction must be in lowercase and formatted using the ISO 639-3 language code enclosed in square brackets (e.g., [eng] for English).

Data

Participants are allowed to use any existing or collected dataset, as long as they report where/how the data was collected.
Participants are not allowed to download the test data from the evaluation server.

We provide the ML-SUPERB 2.0 public set as a baseline dataset for both training and development (recommended but not required).
Data links: Google Drive, Huggingface

This baseline dataset is sourced from a variety of multilingual speech corpora (e.g., FLEURS, Common Voice, Multilingual LibriSpeech, etc.), and it covers 141 of the required 153 languages:

  • Training subset: 1 hour of transcribed speech per language.
  • Development subset: 10 minutes of transcribed speech per language, suitable for evaluating performance on standard multilingual metrics (i.e., the first four metrics listed at the top of this page).

We also release a development set for various language varieties, which covers 56 dialects and accents (recommended but not required). Data links: Google Drive and Huggingface

Please find detailed descriptions for the data, especially the development data, in the data description page.

Important notes:

  • We remove all Norwegian data ([nno], [nob], [nor]) from evaluation.
  • We merge Oriya and Odia data together. Please use [ori] for both.

Acknowledgement: we thank Tanel Alumäe for pointing out very important issues which leads to changes mentioned above.

In summary:

  • Training data:
    • We provide ML-SUPERB 2.0 training subset (recommended but not required).
    • Participants may also use any other existing dataset or collected dataset, as long as they clearly report their sources (some useful resources are listed here).
  • Development data:
    • Participants can use the ML-SUPERB 2.0 development set to measure performance (recommended but not required).
    • Participants can also use a development set for various language varieties (recommended but not required).
  • More details are presented in the data description page.

Models

Participants are allowed to use any pre-trained model (such as LLMs) or modelling technique they desire. The only restriction for models is that they:

  • Can perform inference without an internet connection.
  • Can perform inference within the server's GPU VRAM (~24 GB).

Baseline model

We provide a baseline model and training recipe to aid model development. The baseline model is an MMS model with 1 billion parameters, which is fine-tuned on the ML-SUPERB 2.0 training set using ESPnet. The MMS model is frozen during fine-tuning, and a weighted sum of its layers is fed into a simple 2-layer Transformer encoder, which is trained using the CTC loss. The total number of trainable parameters is 6.36M. Details about the training configuration can be found here. The total training time is around 48 hours using a single H100 80GB GPU and 96 hours using a single A100 40GB GPU. If you use a smaller pre-trained model, such as MMS 316M or XLS-R 316M, the training time is approximately 24 hours on a single A100. With CTC greedy search, inference on all of the dev/test sets required around 2.5-3 A100 hours and consumed at most 8GB of VRAM. Below are the scores of the MMS baseline on the dev/test sets.

ModelStandard CERStandard LIDWorst 15 CERCER StDDialect CERDialect LID
MMS 1B CTC24.074.071.125.532.754.0

We also plan to release a stronger baseline system based on OWSM. Please check back later for details.

Evaluation and Scoring

Models will be evaluated on two test sets:

  • A "standard" multilingual ASR test set.
  • A multilingual ASR test set with data from various language varieties.

Inference and evaluation will be automatically performed on the server, so participants will not be allowed access to the raw audio nor raw model outputs. The overall ranking will be determined by a submission's average ranking across six evaluation metrics:

  • Average LID accuracy across all 154 languages.
  • Average Character Error Rate (CER) across all 154 languages.
  • Average CER across the 15 worst-performing languages in the set of 154 languages.
  • Standard Deviation of CER computed for each of the 154 languages.
  • Average LID across the evaluated language varieties.
  • Average CER across the evaluated language varieties.

CER is calculated using unpunctuated upper-case text. Spaces are considered, except on languages that do not have space-delimited word boundaries (Chinese, Japanese, Thai). Here is the code that we will use for the normalization:

1 import re # Python regex 2 import string # Python string manipulation library 3 import unicodedata # unicode punctuation detection 4 from jiwer import cer # CER implementation 5 6 def remove_punctuation(sentence): 7 8 new_sentence = "" 9 for char in sentence: 10 # all unicode punctuation is of type P 11 if unicodedata.category(char).startswith('P'): 12 continue 13 else: 14 new_sentence = f"{new_sentence}{char}" 15 return new_sentence 16 17 def normalize_and_calculate_cer(hyp: str, ref: str, remove_spaces: bool): 18 19 # remove space for Chinese/Japanese/Thai 20 if remove_spaces: 21 hyp = re.sub(r"\s", "", hyp) 22 ref = re.sub(r"\s", "", ref) 23 24 # remove punctuation 25 hyp = remove_punctuation(hyp) 26 ref = remove_punctuation(ref) 27 28 # upper case everything 29 hyp = hyp.upper() 30 ref = ref.upper() 31 32 return cer(hyp, ref)

The full implementation can be found here. For example, the sentence "I'll be going to the CMU campus." will be converted to "ILL BE GOING TO THE CMU CAMPUS". If we have a hypothesis from an ASR system that is "ill be going to the see them you campus", it will be converteed to "ILL BE GOING TO THE SEE THEM YOU CAMPUS". Before normalization, the CER would be 46.85%. After normalization, it would be 33.33%.

For a language that does not use whitespaces, such as Chinese, an example normalization will convert "我想去餐厅 我非常饿" to 我想去餐厅我非常饿".

We will not ask participants to identify the dialect spoken in an utterance (eg. AAVE vs British English), but they are free to use that information in their systems internally.

If a tie-breaker is necessary, we will use the average of the raw metric scores.

The ranking methodology can be summarized as follows:

  1. Calculate the rankings for each model on each individual metric.
  2. Calculate the average ranking for each model across metrics.
  3. Calculate the average of all raw metric scores for tie-breakers (using 1 - LID accuracies, when necessary, such that lower is always better).
  4. Rank models by their average ranking.

Example Ranking

Here is an example ranking based on some baseline models that we trained using the ML-SUPERB 1-hour baseline dataset. Please note that this example is provided for demonstration purposes only.

Raw Scores

ModelStandard CERStandard LIDWorst 15 CERCER StDDialect CERDialect LID
XEUS22.477.178.930.923.279.1
MMS 1B24.074.071.125.532.754.0
w2v-BERT 2.028.276.681.826.740.945.9
XLS-R 128 1B34.270.998.328.932.758.1
XLS-R 128 300M31.772.486.826.930.560.7
XLSR 5338.363.693.326.923.371.7
WavLM47.255.9131.836.427.277.8

Ranking

ModelStandard CERStandard LIDWorst 15 CERCER StDDialect CERDialect LIDAvg. RankingFinal Ranking
XEUS1126112.01
MMS 1B2311563.02
w2v-BERT 2.03232774.04
XLS-R 128 1B5565555.16
XLS-R 128 300M4443443.83
XLSR 536653234.15
WavLM7777325.57

Submission and leaderboard

Participants will upload their API definition and model weights to DynaBench, where inference and scoring will be performed automatically. There is a live leaderboard on DynaBench where we display the rankings of submissions. A tutorial on submitting to DynaBench can be found here. Example submission scripts can also be found here. If you have any questions regarding the submission procedure, please ask questions in our Slack workspace.

The leaderboard will be frozen for challenge ranking purposes on 2/12 11:59AM AoE (12 hours between Interspeech initial submission deadline). Inference must be finished before the deadline to appear in the final leaderboard. We expect inference to take 8-24 hours, depending on the submitted model size and architecture. For a 300M/1B CTC model, inference took 15/21 hours respectively. Each team may submit multiple models. To rank teams against each other, we will only consider the best performing model. Please make sure to include your team details (authors, team name) in the description section of each submission. Participants may continue to submit after the 2/12 deadline to perform additional ablations and experiments, but those submissions will not affect the final challenge ranking.

Importantly, we allow a maximum of three submissions to the leaderboard per day (this may change in the future depending on the server load). Please note that all model checkpoints will be deleted from the evaluation server after inference. We may use the transcriptions generated by the submitted systems for further analyses, but we will not release them without the authors' permission.

Helpful Resources

Rules

  • Inference and scoring must be performed on the evaluation server.
  • Do not attempt to download the test data (or its derivatives) from the evaluation server.
  • Model inference must be performed without connection to the internet.
  • Model inference must be able to be performed within the GPU memory constraints of the server.
  • There are no constraints about the inference runtime, but we reserve the right to terminate inference jobs that are excessively long.
  • Multiple submissions to the evaluation server are allowed.
  • Only one submission from each team will be considered for the final ranking.
  • There is a maximum of three submissions to the leaderboard per day. This is subject to change depending on server load.
  • Individuals can be a part of multiple teams.
  • Organizers and technical committee members are allowed to participate in the challenge.

Organizers

William Chen (CMU)

Jiatong Shi (CMU)

Shih-Heng Wang (CMU)

Shinji Watanabe (CMU)

Antonios Anastasopoulos (GMU)

Chutong Meng (GMU)

Martijn Bartelds (Stanford)

Dan Jurafsky (Stanford)

Hung-yi Lee (NTU)

Hsiu-Hsuan Wang (NTU)

Karen Livescu (TTIC)

Contact

superb.announcement@gmail.com