mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dhruv Kumar (JIRA)" <>
Subject [jira] [Commented] (MAHOUT-627) Baum-Welch Algorithm on Map-Reduce for Parallel Hidden Markov Model Training.
Date Mon, 11 Jul 2011 14:48:00 GMT


Dhruv Kumar commented on MAHOUT-627:

Uploaded a new patch after a week's worth of testing:

- Bug fixes for a few corner cases
- Refactoring of the BaumWelchUtils and BaumWelchMapper class.
- Added verbose loggers for debugging.

> Baum-Welch Algorithm on Map-Reduce for Parallel Hidden Markov Model Training.
> -----------------------------------------------------------------------------
>                 Key: MAHOUT-627
>                 URL:
>             Project: Mahout
>          Issue Type: Task
>          Components: Classification
>    Affects Versions: 0.4, 0.5
>            Reporter: Dhruv Kumar
>            Assignee: Grant Ingersoll
>              Labels: gsoc, gsoc2011, mahout-gsoc-11
>             Fix For: 0.6
>         Attachments: MAHOUT-627.patch, MAHOUT-627.patch, MAHOUT-627.patch
> Proposal Title: Baum-Welch Algorithm on Map-Reduce for Parallel Hidden Markov Model Training.

> Student Name: Dhruv Kumar 
> Student E-mail: 
> Organization/Project: Apache Mahout 
> Assigned Mentor: 
> Proposal Abstract: 
> The Baum-Welch algorithm is commonly used for training a Hidden Markov Model because
of its superior numerical stability and its ability to guarantee the discovery of a locally
maximum,  Maximum Likelihood Estimator, in the presence of incomplete training data. Currently,
Apache Mahout has a sequential implementation of the Baum-Welch which cannot be scaled to
train over large data sets. This restriction reduces the quality of training and constrains
generalization of the learned model when used for prediction. This project proposes to extend
Mahout's Baum-Welch to a parallel, distributed version using the Map-Reduce programming framework
for enhanced model fitting over large data sets. 
> Detailed Description: 
> Hidden Markov Models (HMMs) are widely used as a probabilistic inference tool for applications
generating temporal or spatial sequential data. Relative simplicity of implementation, combined
with their ability to discover latent domain knowledge have made them very popular in diverse
fields such as DNA sequence alignment, gene discovery, handwriting analysis, voice recognition,
computer vision, language translation and parts-of-speech tagging. 
> A HMM is defined as a tuple (S, O, Theta) where S is a finite set of unobservable, hidden
states emitting symbols from a finite observable vocabulary set O according to a probabilistic
model Theta. The parameters of the model Theta are defined by the tuple (A, B, Pi) where A
is a stochastic transition matrix of the hidden states of size |S| X |S|. The elements a_(i,j)
of A specify the probability of transitioning from a state i to state j. Matrix B is a size
|S| X |O| stochastic symbol emission matrix whose elements b_(s, o) provide the probability
that a symbol o will be emitted from the hidden state s. The elements pi_(s) of the |S| length
vector Pi determine the probability that the system starts in the hidden state s. The transitions
of hidden states are unobservable and follow the Markov property of memorylessness. 
> Rabiner [1] defined three main problems for HMMs: 
> 1. Evaluation: Given the complete model (S, O, Theta) and a subset of the observation
sequence, determine the probability that the model generated the observed sequence. This is
useful for evaluating the quality of the model and is solved using the so called Forward algorithm.

> 2. Decoding: Given the complete model (S, O, Theta) and an observation sequence, determine
the hidden state sequence which generated the observed sequence. This can be viewed as an
inference problem where the model and observed sequence are used to predict the value of the
unobservable random variables. The backward algorithm, also known as the Viterbi decoding
algorithm is used for predicting the hidden state sequence. 
> 3. Training: Given the set of hidden states S, the set of observation vocabulary O and
the observation sequence, determine the parameters (A, B, Pi) of the model Theta. This problem
can be viewed as a statistical machine learning problem of model fitting to a large set of
training data. The Baum-Welch (BW) algorithm (also called the Forward-Backward algorithm)
and the Viterbi training algorithm are commonly used for model fitting. 
> In general, the quality of HMM training can be improved by employing large training vectors
but currently, Mahout only supports sequential versions of HMM trainers which are incapable
of scaling.  Among the Viterbi and the Baum-Welch training methods, the Baum-Welch algorithm
is superior, accurate, and a better candidate for a parallel implementation for two reasons:
> (1) The BW is numerically stable and provides a guaranteed discovery of the locally maximum,
Maximum Likelihood Estimator (MLE) for model's parameters (Theta). In Viterbi training, the
MLE is approximated in order to reduce computation time. 
> (2) The BW belongs to the general class of Expectation Maximization (EM) algorithms which
naturally fit into the Map-Reduce framework [2], such as the existing Map Reduce implementation
of k-means in Mahout. 
> Hence, this project proposes to extend Mahout's current sequential implementation of
the Baum-Welch HMM trainer to a scalable, distributed case. Since the distributed version
of the BW will use the sequential implementations of the Forward and the Backward algorithms
to compute the alpha and the beta factors in each iteration, a lot of existing HMM training
code will be reused. Specifically, the parallel implementation of the BW algorithm on Map
Reduce has been elaborated at great length in [3] by viewing it as a specific case of the
Expectation-Maximization algorithm and will be followed for implementation in this project.

> The BW EM algorithm iteratively refines the model's parameters and consists of two distinct
steps in each iteration--Expectation and Maximization. In the distributed case, the Expectation
step is computed by the mappers and the reducers, while the Maximization is handled by the
reducers. Starting from an initial Theta^(0), in each iteration i, the model parameter tuple
Theta^i is input to the algorithm, and the end result Theta^(i+1) is fed to the next iteration
i+1. The iteration stops on a user specified convergence condition expressed as a fixpoint
or when the number of iterations exceeds a user defined value. 
> Expectation computes the posterior probability of each latent variable for each observed
variable, weighed by the relative frequency of the observed variable in the input split. The
mappers process independent training instances and emit expected state transition and emission
counts using the Forward and Backward algorithms. The reducers finish Expectation by aggregating
the expected counts. The input to a mapper consists of (k, v_o) pairs where k is a unique
key and v_o is a string of observed symbols. For each training instance, the mappers emit
the same set of keys corresponding to the following three optimization problems to be solved
during the Maximization, and their values in a hash-map:
> (1) Expected number of times a hidden state is reached (Pi).
> (2) Number of times each observable symbol is generated by each hidden state (B).
> (3) Number of transitions between each pair of states in the hidden state space (A).

> The M step computes the updated Theta(i+1) from the values generated during the E part.
This involves aggregating the values (as hash-maps) for each key corresponding to one of the
optimization problems. The aggregation summarizes the statistics necessary to compute a subset
of the parameters for the next EM iteration. The optimal parameters for the next iteration
are arrived by computing the relative frequency of each event with respect to its expected
count at the current iteration. The emitted optimal parameters by each reducer are written
to the HDFS and are fed to the mappers in the next iteration. 
> The project can be subdivided into distinct tasks of programming, testing and documenting
the driver, mapper, reducer and the combiner with the Expectation and Maximization parts split
between them. For each of these tasks, a new class will be programmed, unit tested and documented
within the org.apache.mahout.classifier.sequencelearning.hmm package. Since k-means is also
an EM algorithm, particular attention will be paid to its code at each step for possible reuse.
> A list of milestones, associated deliverable and high level implementation details is
given below. 
> Time-line: April 26 - Aug 15. 
> Milestones: 
> April 26 - May 22 (4 weeks): Pre-coding stage. Open communication with my mentor, refine
the project's plan and requirements, understand the community's code styling requirements,
expand the knowledge on Hadoop and Mahout internals. Thoroughly familiarize with the classes
within the classifier.sequencelearning.hmm, clustering.kmeans, common, vectorizer and math
> May 23 - June 3 (2 weeks): Work on Driver. Implement, test and document the class HmmDriver
by extending the AbstractJob class and by reusing the code from the KMeansDriver class. 
> June 3 - July 1 (4 weeks): Work on Mapper. Implement, test and document the class HmmMapper.
The HmmMapper class will include setup() and map() methods. The setup() method will read in
the HmmModel and the parameter values obtained from the previous iteration. The map() method
will call the HmmAlgorithms.backwardAlgorithm() and the HmmAlgorithms.forwardAlgorithm() and
complete the Expectation step partially. 
> July 1 - July 15 (2 weeks): Work on Reducer. Implement, test and document the class HmmReducer.
The reducer will complete the Expectation step by summing over all the occurences emitted
by the mappers for the three optimization problems. Reuse the code from the HmmTrainer.trainBaumWelch()
method if possible. Also, mid-term review.
> July 15 - July 29 (2 weeks): Work on Combiner. Implement, test and document the class
HmmCombiner. The combiner will reduce the network traffic and improve efficiency by aggregating
the values for each of the three keys corresponding to each of the optimization problems for
the Maximization stage in reducers. Look at the possibility of code reuse from the KMeansCombiner
> July 29 - August 15 (2 weeks): Final touches. Test the mapper, reducer, combiner and
driver together. Give an example demonstrating the new parallel BW algorithm by employing
the parts-of-speech tagger data set also used by the sequential BW [4]. Tidy up code and fix
loose ends, finish wiki documentation. 
> Additional Information: 
> I am in the final stages of finishing my Master's degree in Electrical and Computer Engineering
from the University of Massachusetts Amherst. Working under the guidance of Prof. Wayne Burleson,
as part of my Master's research work, I have applied the theory of Markov Decision Process
(MDP) to increase the duration of service of mobile computers. This semester I am involved
with two course projects involving machine learning over large data sets. In the Bioinformatics
class, I am mining the RCSB Protein Data Bank [5] to learn the dependence of side chain geometry
on a protein's secondary structure, and comparing it with the Dynamic Bayesian Network approach
used in [6]. In another project for the Online Social Networks class, I am using reinforcement
learning to build an online recommendation system by reformulating MDP optimal policy search
as an EM problem [7] and employing Map Reduce (extending Mahout) to arrive at it in a scalable,
distributed manner. 
> I owe much to the open source community as all my research experiments have only been
possible due to the freely available Linux distributions, performance analyzers, scripting
languages and associated documentation. After joining the Apache Mahout's developer mailing
list a few weeks ago,  I have found the community extremely vibrant, helpful and welcoming.
If selected, I feel that the GSOC 2011 project will be a great learning experience for me
from both a technical and professional standpoint and will also allow me to contribute within
my modest means to the overall spirit of open source programming and Machine Learning. 
> References: 
> [1] A tutorial on hidden markov models and selected applications in speech recognition
by Lawrence R. Rabiner. In Proceedings of the IEEE, Vol. 77 (1989), pp. 257-286. 
> [2] Map-Reduce for Machine Learning on Multicore by Cheng T. Chu, Sang K. Kim, Yi A.
Lin, Yuanyuan Yu, Gary R. Bradski, Andrew Y. Ng, Kunle Olukotun. In NIPS (2006), pp. 281-288.

> [3] Data-Intensive Text Processing with MapReduce by Jimmy Lin, Chris Dyer. Morgan &
Claypool 2010. 
> [4] 
> [5]
> [6] Beyond rotamers: a generative, probabilistic model of side chains in proteins by
Harder T, Boomsma W, Paluszewski M, Frellsen J, Johansson KE, Hamelryck T. BMC Bioinformatics.
2010 Jun 5.
> [7] Probabilistic inference for solving discrete and continuous state Markov Decision
Processes by M. Toussaint and A. Storkey. ICML, 2006.

This message is automatically generated by JIRA.
For more information on JIRA, see:


View raw message