Overview
The data provided consists of a training set, a development test set, and a (final) evaluation test set. The evaluation test set will be made available on Nov 5, 2013. Before distribution of the evaluation test set, the participants can develop their systems based on the training and development test sets. The development test set and the final evaluation test set each consist of two different parts, namely simulated data (SimData) and real recordings (RealData) as:
- SimData: utterances from the WSJCAM0 corpus [1], which are convolved by room impulse responses (RIRs) measured in different rooms. Recorded background noise is added to the reverberated test data at a fixed signal-to-noise ratio (SNR).
- RealData: utterances from the MC-WSJ-AV corpus [2], which consists of utterances recorded in a noisy and reverberant room.
The SimData test set captures a broad range of reverberation conditions so that they allow evaluating the approaches in different reverberation conditions. The RealData test set aims at evaluating the robustness of the approaches against variations that are not reproducible by simulation for a certain reverberation condition. The structure of the evaluation test set will be similar to that of the development test set. The training set consists of the clean WSJCAM0 training set and a multi-condition training set, which is generated from the clean WSJCAM0 training data by convolving the clean utterances with measured room impulse responses and adding recorded background noise. All reverberant utterances will be provided as 1-channel, 2-channel, and 8-channel recordings. The challenge consists of two tasks: one for speech enhancement (SE) and the other for automatic speech recognition (ASR). Each participant can choose to take part in either or both of the tasks. Further information on the data is provided below.
[1] T. Robinson, J. Fransen, D. Pye and J. Foote and S. Renals, "Wsjcam0: A British English Speech Corpus For Large Vocabulary Continuous Speech Recognition", In Proc. ICASSP 95, pp.81--84, 1995
[2] M. Lincoln, I. McCowan, J. Vepa and H.K. Maganti, "The multi-channel Wall Street Journal audio visual corpus (MC-WSJ-AV): Specification and initial experiments", IEEE Workshop on Automatic Speech Recognition and Understanding, 2005
Details on challenge data
The main purpose of SimData is to provide accurate performance measurements of the participants' methods under various reverberation conditions by comparing reference clean speech and processed/observed speech. On the other hand, the main purpose of RealData is to evaluate the performance of the participants' methods in realistic environments and attest their practicality. Note that the text prompts of the utterances used in SimData and RealData are the same, but the utterances are spoken by different speakers. Therefore, the participants can use the same language and acoustic models for both SimData and RealData, and comparison of the results of SimData and RealData may hopefully provide some novel insights.
SimData contains a set of reverberant speech signals that are artificially simulated by convolving clean speech signals with measured room impulse responses (RIRs) and subsequently adding measured noise signals. It simulates 6 different reverberation conditions: 3 rooms with different volumes (small, medium and large size), 2 types of distances between a speaker and a microphone array (near=50cm and far=200cm). RIRs are measured in 3 different rooms with an 8-ch cicular array with diameter of 20 cm as shown in the figure. The array is equipped with omni-directional microphones. Stationary background noise, which is caused mainly by air conditioning systems in a room, is measured under the same conditions with the same arrays as used for RIR measurement.
RealData contains a set of real recordings made in a reverberant meeting room which is different from the ones used for SimData. It contains 2 reverberation conditions: 1 room, 2 types of distances between a speaker and a microphone array (near=~100cm and far=~250cm). Each speaker uttered WSJCAM0 prompts from certain locations in a room. They stayed still while they were uttering a sentence. Recordings are measured with an array that has the same array geometry as the ones used for SimData. The recordings contain some amount of stationary ambient noise.
Fig. Microphone array used for measuring RIRs
Further details of the data are summarized in the following.
- For the SimData, noise is added to the reverberant speech with SNR of 20dB. In this case, "S" stands for the energy of direct signal and early reflections up to 50ms. "N" stands for the energy of additive noise component.
- Rereverberation time (T60) of small, medium, and large-size rooms are about 0.25s, 0.5s, 0.7s, respectively. On the other hand, a meeting room used for the RealData recording has reverberation time of 0.7s.
- For the SimData and RealData, it can be assumed that the speakers stay still within an utterance (No drastic change in RIR within an utterance).
- For the SimData and RealData, it can be assumed that the speakers stay in the same room for each test condition(*1).
However, within each condition, relative speaker-microphone position changes from utterance to utterance, i.e., a RIR for the 1st channel
changes slightly for each utterance.
(*1) A test condition refers to 1 of 8 conditions (room, speaker-mic distance (near or far)) employed in the challenge (2 conditions for RealData, and 6 conditions for SimData). - Recording rooms used for SimData, RealData and multi-condition training data are all different.
- For the SimData, the signal in the reference channel (files contained in task files "*_A") is time-aligned with the corresponding clean speech, and provided SE evaluation tools work based on this assumption. Those who evaluate their SE algorithms should generate output speech data such that it aligns with the reference channel.
Here are some examples of the challenge data.
- SimData (small-size meeting room - Room 1, far-condition):
- SimData (medium-size meeting room - Room 2, near-condition):
- SimData (large-size meeting room - Room 3, far-condition):
- RealData (large-size meeting room, far-condition):
Obtaining challenge data
The SimData, RealData and clean training data will be provided to participants through the LDC (Linguistic Data Consortium) for free of charge, provided that the data will be used only for this challenge. If you are interested in obtaining data and participating in the challenge, please visit the Download page and follow instructions therein. Please contact us. We will provide you an agreement form of the data, which you have to sign on and send to the LDC. If you send the signed agreement form to the LDC, they will provide you download links for the challenge data.
Since it is not obvious from the data structure which data in SimData corresponds to 1 of 6 test conditions, we will distribute file lists, i.e., task file, that indicates such information. We also distribute task files for RealData that indicate each of 2 conditions in this dataset. The task files are contained in both ASR baseline system and SE evaluation tool. (They are common for SE and ASR tasks.) Please use distributed task files contained in the evaluation tools for processing and evaluation of data corresponding to each test condition. For more details of task files, please see readme files in the evaluation tools.
For the multi-condition training set, we will provide to participants room impulse responses (RIRs), noise data, and MATLAB scripts to generate noisy reverberant speech data so that they can generate multi-condition training data based on WSJCAM0 data obtained from the LDC. The scripts and other required items to generate the multi-condition model is available for download at the Download section.
Please refer to the following table that summarizes which data will be distributed/obtained from where.
Table: Information regarding distributor of each data.
From where | Remark | |
SimData | From LDC* | Based on WSJCAM0, noisy reverberant |
RealData | From LDC* | Based on MC-WSJ-AV |
Clean training data | From LDC* | Based on WSJCAM0, clean |
Multi-condition Training data | Generated using scripts distributed from the Download section | Based on WSJCAM0, noisy reverberant |
Note that we will not accept data requests after Dec. 1, 2013.