ASRU 2015 Challenge tasks


Michiel Bacchiani, Pino Di Fabbrizio, Jason D. Williams

The 2015 IEEE workshop on Automatic Speech Recognition and Understanding has added challenge tasks as a new component of the workshop program. The goal is to support and promote shared research problems of interest to the community by providing a public, peer-reviewed venue for the reporting of results. An open call for challenge tasks was issued in October 2014, and closed at the end of December 2014. Three challenge tasks were accepted.

Papers related to the challenges -- from participants and organizers -- will be submitted, reviewed, and evaluated in the same way as all ASRU papers. Challenge task organizers will be asked to suggest qualified reviewers for their challenge task. In all other respects, challenge papers will be handled in the same way as normal ASRU papers, by the ASRU technical program committee, who will assign reviewers and make acceptance decisions.

At ASRU 2015, there will be a special session for each challenge task, where accepted papers will be presented as posters (all accepted papers at ASRU are presented as posters). Rejected challenge papers may still be invited for presentation only, by challenge organizers in the special session, subject to available space.

The challenge tasks themselves are operated by their respective organizers, independently of ASRU. Descriptions of each are given in the next three sections. For more information about these challenge tasks, contact their respective organizers.

ASRU 2015 will be held in Scottsdale, Arizona, USA 13-17 December 2015. For more information on ASRU, see

3rd CHiME Speech Separation and Recognition Challenge

Description from organizers Jon Barker, Ricard Marxer, Emmanuel Vincent, and Shinji Watanabe

The 3rd CHiME challenge considers the problem of distant microphone speech recognition in multisource noise environments. The target scenario is a mobile tablet device that captures the user's speech with an array of six microphones positioned around its frame.

In contrast to the earlier 2011 and 2013 CHiME challenges, the new challenge employs speech that has been spoken live in noisy environments, i.e. as opposed to using potentially unrealistic artificial remixing. To capture the data, sentences have been presented on a tablet and read by US-accented speakers in four different target environments: cafes; street junctions; public buses and pedestrian areas. By recording live speech, the challenge captures many extra sources of variability, including Lombard effects and variability due to tablet and speaker motion. For compatibility with the previous CHiME challenge and other recent noise robust speech challenges, the task has been based on the WSJ 5k corpus. Separate sets of data will be provided for training, development testing and evaluation testing. Each set will employ four different talkers and different instances of the target environments.

To help participants build their systems and to provide a common basis for comparison, the challenge will provide three baseline tools: a simulation tool for the generation of arbitrary amounts of additional training data, a speech enhancement baseline, and a state-of-the-art DNN-based decoder. Participants will be free to modify these tools or to use their own.

For more information, see

Slides: Download PPTX file

Automatic Speech recognition In Reverberant Environments (ASpIRE)

Organized by IARPA; description from Mary Harper, IARPA

The IARPA-sponsored ASpIRE (Automatic Speech recognition In Reverberant Environments) Challenge seeks to foster the development of innovative speech recognition systems that can be trained on conversational telephone speech, yet work well on far-field microphone data recorded in noisy, reverberant rooms. Challenge “Solvers” are given access to sample data against which they can assess the development of their algorithms; these data are different from the test set, but provide a good representation of microphone recordings in real rooms. Solvers will then have the opportunity to evaluate their techniques on a common and challenging test set that includes significant room noise and reverberation. There are two evaluation conditions:

  1. The Single Microphone (single-mic) Condition tests the ability to mitigate noise and reverberation in speech recorded in several rooms with a variety of microphones when provided a randomly-selected single microphone recording of each recorded conversation. The single microphone condition evaluation will run February 4-11, 2015.
  2. The Multiple Microphone (multi-mic) Condition tests the ability to mitigate noise and reverberation in speech recorded in several rooms with a variety of microphones when given several different microphone recordings of each recorded conversation. The multiple microphone condition evaluation will run February 12-19, 2015.

See for full details about the evaluation process, awards, and timeline.

Slides: Download PPTX file

The MGB Challenge - Recognition of Multi-Genre Broadcast Data

Description from organizers Steve Renals, Phil Woodland, Mark Gales, Pierre Lanchantin, Thomas Hain, Oscar Saz, Peter Bell, and Andrew McParland

The MGB Challenge is a core evaluation of speech recognition, speaker diarization, and lightly supervised alignment using BBC TV recordings. The data is broad and multi-genre, spanning the whole range of BBC TV output. The challenge will use a training set of about 1,600 hours of broadcast audio, together with several hundered million words of subtitle text for language modelling, all provided by the BBC. The MGB Challenge will explore speech recognition and speaker diarization in a longitudinal setting - i.e. transcription and speaker diarization & linking of several episodes of the same programme. The longitudinal tasks will also offer the opportunity for systems to make use of supplied metadata including programme title, genre tag, and date/time of transmission.

The MGB Challenge will have four main evaluation conditions: (1) speech-to-text transcription of broadcast television; (2) alignment of broadcast audio to a subtitle file (which may be regarded as a lightly-supervised transcript); (3) longitudinal speech-to-text transcription of a sequence of episodes from the same series of programmes; and (4) longitudinal speaker diarization and linking, requiring the identification of common speakers across multiple recordings.

More details at the challenge website,

Slides: Download PDF file