Speaker Adaptation with MLLR Transformation

Unsupervised speaker adaptation for Sphinx4

For building an improved acoustic model there are two methods. One of them 
needs to collect data from a speaker and train the acoustic model set. Thus 
using the speaker's characteristics the recognition will be more accurately. 
The disadvantage of this method is that it needs a large amount of data to be 
collected to have a sufficient model accuracy.

The other method, when the amount of data available is small from a new 
speaker, is to collect them and by using an adaptation technique to adapt the 
model set to better fit the speaker's characteristics.

The adaptation technique used is MLLR (maximum likelihood linear regression) 
transform that is applied depending on the available data by generating one or 
more transformations that reduce the mismatch between
an initial model set and the adaptation data. There is only one transformation
when the amount of available data is too small and is called global adaptation 
transform. The global transform is applied to every Gaussian component in the 
model set. Otherwise, when the amount of adaptation data is large, the number
of transformations is increasing and each transformation is applied to a 
certain cluster of Gaussian components.

To be able to decode with an adapted model there are two important classes that 
should be imported:

import edu.cmu.sphinx.decoder.adaptation.Stats;
import edu.cmu.sphinx.decoder.adaptation.Transform;

Stats Class estimates a MLLR transform for each cluster of data and the 
transform will be applied to the corresponding cluster. You can choose the 
number of clusters by giving the number as argument to 
createStats(nrOfClusters) in Stats method. The method will return an object 
that contains the loaded acoustic model and the number of clusters. This 
important to collect counts from each Result object because based on them we 
will perform the estimation of the MLLR transformation. 

Before starting collect counts it is important to have all Gaussians clustered. 
So, createStats(nrOfClusters) will generate an ClusteredDensityFileData object 
to prepare the Gaussians. ClusteredDensityFileData class performs the clustering
using the "k-means" clustering algorithm. The k-means clustering algorithm aims 
to partition the Gaussians into k clusters in which each Gaussian belongs
to the cluster with the nearest mean. It is interesting to know that the problem
of clustering is computationally difficult, so the heuristic used is the 
Euclidean criterion.

The next step is to collect counts from each Result object and store them 
separately for each cluster. Here, the matrices regLs and regRs used in 
computing the transformation are filled. Transform class performs the actual 
transformation for each cluster. Given the counts previously gathered and the 
number of clusters, the class will compute the two matrices A (the 
transformation matrix) B (the bias vector) that are tied across the Gaussians 
from the corresponding cluster. A Transform object will contain all the 
transformations computed for an utterance. To use the adapted acoustic model it 
is necessary to update the Sphinx3Loader which is responsible for
loading the files from the model. When updating occurs, the acoustic model is 
already loaded, so setTransform(transform) method will replace the old means 
with the new ones.

Now, that we have the theoretical part, let’s see the practical part. Here is 
how you create and use a MLLR transformation:

Stats stats = recognizer.createStats(1);
recognizer.startRecognition(stream);
while ((result = recognizer.getResult()) != null) {
	stats.collect(result);
}
recognizer.stopRecognition();

// Transform represents the speech profile
Transform transform = stats.createTransform();
recognizer.setTransform(transform);

After setting the transformation to the StreamSpeechRecognizer object,
the recognizer is ready to decode using the new means. The process
of recognition is the same as you decode with the general acoustic model.
When you create and set a transformation is like you create a
new acoustic model with speaker's characteristics, thus the accuracy
will be better. 

For further decodings you can store the transformation of a speaker in a file
by performing store(“FilePath”, 0) in Transform object.

If you have your own transformation known as mllr_matrix previously generated 
with Sphinx4 or with another program, you can load the file by performing 
load(“FilePath”) in Transform object and then to set it to an Recognizer object.

