Over the next 2 months, two spoken language recognition marathon matches will launch on the Topcoder Platform. The first will be a ‘Mini Marathon Match,’ launching on June 12th, and the second will be a full ‘Marathon Match’ (specific dates to come soon). $20,000 in prizes is at stake, so may the best members win! In this blog, we are encouraging members to read the information below to help prepare for these matches.
Audio data from various sources such as telephone conversations, recordings of meetings, and newscasts is increasing daily in our digital world. One can extract and mine various features from audio signals and use those features for various purposes. In this series of challenges we are looking to recognize languages from a known mix using 10s of audio data for the language.
Phonemes are the smallest unit of speech in a language, for example the long “a” sound, that form speech. Phonemes comprise all words are sets of phonemes. Phoneme based approaches analyze and identify sounds in a piece of audio content to create a phonetic based index and then use a phoneme based dictionary for analysis.
According to H. Li et al. (2013), “mel-frequency cepstral coefficients (MFCCs) are effective in most speech recognition tasks because they exploit auditory principles, whereby the mel-scale filter bank is a rough approximation to human auditory system’s response.”
We have worked with MFCCs in a small language recognition task. In it, we divided the speech into 25 millisecond frames, each overlapping by 10 milliseconds, and multiplied each frame by the Hamming window in order to smooth discontinuities in the signal. From each of these frames, we extracted the MFCCs.
First, we compiled an initial class of features by comparing the short-time spectra after compressing them in a manner that re-bins and rescales the spectrogram to better mimic how humans perceive sound. The mel-scale is one such scale. Integrating the power spectrum against a bank of filters generates these features. Taking the inverse Fourier transform after this mapping picks out harmonics of voiced signals and generates the MFCCs. The co-varying harmonics clearly are a unique feature of the spectra. In taking the Fourier transform of the power spectra in this fashion, one hopes that the fundamental frequency of voice box-or sound source might be identified and used as a feature. However, this technique clearly cannot capture all harmonics, as only those that are equally spaced in the mel-scale will be detected, and most sound sources only follow that pattern approximately. A key relationship follows.
Mel(f)=1127(ln(1+f/700)) (Jurafsky 2009)
Besides the cepstral coefficients, we also investigated using delta features. Delta features are the first and second time derivatives of the cepstral coefficients, capturing the change of the cepstral features over time, which we hypothesized would be useful in classifying language, since the pace of speech is an important factor in human language recognition. We can calculate these features as the central finite difference approximation of these derivatives
Extracting MFCCs in Python,Using Scikit and Bob
Here is an example of how to extract MCFFs in Python.
>>> import scipy.io.wavfile
>>> rate, signal = scipy.io.wavfile.read(str(wave_path))
>>> print rate 8000
>>> print signal
[ 20 68 53 …, -230 89 198]
Linear frequency cepstral coefficient (LFCC) and MFCC coefficients can be extracted from an audio signal by using bob.ap.Ceps() in the Bob library. To do so, several parameters can be specified by the use, typically in a configuration file. The following values are the defaults:
>>> win_length_ms = 20 # The window length of the cepstral analysis in milliseconds
>>> win_shift_ms = 10 # The window shift of the cepstral analysis in milliseconds
>>> n_filters = 24 # The number of filter bands
>>> n_ceps = 19 # The number of cepstral coefficients
>>> f_min = 0. # The minimal frequency of the filter bank
>>> f_max = 4000. # The maximal frequency of the filter bank
>>> delta_win = 2 # The integer delta value used for computing the first and second order derivatives
>>> pre_emphasis_coef = 0.97 # The coefficient used for the pre-emphasis
>>> dct_norm = True # A factor by which the cepstral coefficients are multiplied
>>> mel_scale = True # Tell whether cepstral features are extracted on a linear (LFCC) or Mel (MFCC) scale
Once the parameters are specified, bob.ap.Ceps() can be called as follows:
>>> c = bob.ap.Ceps(rate, win_length_ms, win_shift_ms, n_filters, n_ceps, f_min, f_max, delta_win, pre_emphasis_coef, mel_scale, dct_norm)
>>> signal = numpy.cast[‘float’](signal) # vector should be in **float**
>>> mfcc = c(signal)
>>> print len(mfcc)
>>> print len(mfcc)
It is also possible to compute first and second derivatives for those features:
>>> c.with_delta = True
>>> c.with_delta_delta = True
>>> lfcc_e_d_dd = c(signal)
>>> vprint len(lfcc_e_d_dd)
>>> vprint len(lfcc_e_d_dd)
Spectral Energy Peak (SEP)
SEP is used in music identification systems. The SEP is a time-frequency point of higher amplitude than its neighboring points. SEP is argued to be intrinsically robust to even high level background noise and can provide discrimination in sound mixtures. In the well-known Shazam system, time-frequency coordinates of the energy peaks were described as sparse landmark points. y using pairs of landmark points rather than single points, the fingerprints exploited the spectral structure of sound sources
Spectral Flatness Measure (SFM)
SFM, also known as Wiener entropy, relates to the tonality aspect of audio signals and it is therefore often used to distinguish different recordings. Let us denote by s(n, f) a STFT coefficient of an audio signal at time frame index n and frequency bin index f, 1 ≤ f ≤ M. Let us also denote by b an auditory-motivated subband index, i.e., in either Mel, Bark, Log, or Cent scale, and lb and hb the lower and upper edges of b-th subband. SFM is computed in each time-frequency subband point (n, b) as
A high SFM indicates the similarity of signal power over all frequencies, while a low SFM means that signal power is concentrated in a relatively small number of frequencies over the full sub-band.
We built a feed-forward neural network with 3 layers of neurons; the input layer, IL, the hidden layer, HL and the output layer. The above calculated features represented the audio data.