I think this looks really good.
On Apr 4, 2011, at 6:34 PM, Dhruv Kumar wrote:
> Here's a first draft of my proposal. It would be great if I can get any
> feedback before I submit it to GSOC and update Mahout627.
>
> Thank you.
>
> Dhruv
>
>  Proposal Begin 
>
> Proposal Title: Implementation of the BaumWelch Algorithm for parallel
> Hidden Markov Model training on MapReduce.
>
> Student Name: Dhruv Kumar
>
> Student Email: dkumar@ecs.umass.edu
>
> Organization/Project: Apache Mahout
>
> Assigned Mentor:
>
> Proposal Abstract:
>
> The BaumWelch algorithm is commonly used for training a Hidden Markov
> Model as it is numerically stable and provides a guaranteed discovery of
> the Maximum Likelihood Estimator in the presence of incomplete data.
> Currently, Apache Mahout has a sequential implementation of the
> BaumWelch which cannot be scaled to train over large datasets. This
> project proposes to extend the sequential implementation of the
> BaumWelch to a parallel, distributed version using the Map Reduce
> programming framework to allow scalable Hidden Markov Model training.
>
> Detailed Description:
>
> Hidden Markov Models (HMM) are widely used as a probabilistic inference
> tool for applications generating temporal or spatial sequential data.
> Their relative simplicity of implementation and their ability to
> discover latent domain knowledge have made them very popular in fields
> such as DNA sequence alignment, handwriting analysis, voice recognition,
> computer vision and partsofspeech tagging.
>
> A HMM is defined as a tuple (S, O, Theta) where S is a finite set of
> unobservable, hidden states emitting symbols from a finite observable
> vocabulary set O according to a probabilistic model Theta. The
> parameters of the model Theta are defined by the tuple (A, B, Pi) where
> A is a stochastic transition matrix of the hidden states of size S X
> S. The elements a_(i,j) of A specify the probability of transitioning
> from a state i to state j. Matrix B is a size S X O stochastic
> symbol emission matrix whose elements b_(s, o) provide the probability
> that a symbol o will be emitted from the hidden state s. The elements
> pi_(s) of the S length vector Pi determine the probability that the
> system starts in the hidden state s. The transitions of hidden states
> are unobservable and follow the Markov property of memorylessness.
>
> Rabiner [1] defined three main problems for HMMs:
>
> 1. Evaluation: Given the complete model (S, O, Theta) and a subset of
> the observation sequence, determine the probability that the model
> generated the observed sequence. This is useful for determining the
> quality of the model and is solved using the so called Forward
> algorithm.
>
> 2. Decoding: Given the complete model (S, O, Theta) and an observation
> sequence, determine the hidden state sequence which generated the
> observed sequence. This can be viewed as an inference problem where the
> model and observed sequence are used to predict the value of the
> unobservable random variables. The backward algorithm, also known as the
> Viterbi decoding algorithm is used for predicting the hidden state
> sequence.
>
> 3. Training: Given the set of hidden states S, the set of observation
> vocabulary O and the observation sequence, determine the parameters (A,
> B, Pi) of the model Theta. This problem can be viewed as a statistical
> machine learning problem of model fitting to a large set of training
> data. The BaumWelch (BW) algorithm (also called the ForwardBackward
> algorithm) and the Viterbi training algorithm are commonly used for
> model fitting.
>
> Since most problems modeled by HMMs have a modest number of hidden and
> observable states, the sequential versions of the Forward and the
> Viterbi algorithms (currently implemented in Mahout) are sufficient for
> the evaluation and decoding purposes. However, often the training data
> is so large that a single compute node is incapable of handling it.
> Attempts to prune the training data can lead to lower quality model
> fitting and may miss out on crucial domain knowldege because of inferior
> posterior probabilities. Currently, Mahout only supports sequential HMM
> trainersViterbi and Baum Welch which are incapable of scaling to large
> data sets.
>
> This project proposes to extend Mahout's current sequential
> implementation of the Baum Welch HMM trainer to a scalable, distributed
> case. The distributed version of the BW will use the sequential
> implementations of the Forward and the Backward algorithms in each
> iteration, hence a lot of exisitng HMM training code will be reused.
>
> Among the two sequential HMM training methods, the BaumWelch (BW) or
> the ForwardBackward algorithm is superiand a better candidate for a
> parallel implementation for two reasons. (1) The BW is more numerically
> stable and provides guaranteed discovery of Maximum Likelihood Estimator
> (MLE) (albiet a local maximum) unlike Viterbi training where the MLE is
> approximated in order to reduce computation time. (2) The BW belongs to
> the general class of Expectation Maximization (EM) algorithms which
> naturally fit into the Map Reduce framework [2]. Specifically, the
> parallel implementation of the BW algorithm on Map Reduce has been
> elaborated at great length in [3].
>
> The BW EM algorithm iteratively refines the model's parameters and
> consists of two distinct steps in each iterationExpectation (E) and
> Maximization (M). In the distributed case, the E step is computed by the
> mappers and the reducers, while the M is handled by the reducers.
> Starting from an initial Theta(0), in each iteration i, the model
> parameter tuple Theta(i) is input to the algorithm, and the end result
> Theta(i+1) is fed to the next iteration i+1. The iteration stops on a
> user specified convergence condition expressed as a fixpoint or when the
> number of iterations exceeds a user defined value.
>
> The E step computes the posterior probability of each latent varaiable
> for each observed variable, weighed by the relative frequency of the
> observered variable in the input split. The mappers process independent
> training instances and emit expected state transition and emission
> counts using the forwardbackward algorithm. The reducers finish E by
> aggregating the expected counts. The input to a mapper consists of (k,
> v_o) pairs where k is a unique key and v_o is a string of observerd
> symbols of length n. For each training instance, the mappers emit the
> same set of keys corresponding to the following three optimization
> problems to be solved during the M: (1) expected number of times a
> hidden state is reached, (2) the number of times each observable symbol
> is generated by each hidden state and, (3) the number of transitions
> between each pair of states in the hidden state space.
>
> The M step computes the updated Theta(i+1) from the values generated
> during the E part. This involves aggregating the values obtained in the
> E step for each key corresponding to one of the optimization problems.
> The aggregation summarizes the statistics necessary to compute a subset
> of the parameters for the next EM iteration. The optimal parameters for
> the next iteration are arrived by computing the relative frequency of
> each event with respect to its expected count at the current iteration.
> The emitted optimal parameters by each reducer are written to the HDFS
> and are fed to the mappers in the next iteration.
>
> The project can be subdivided into distinct tasks of programming,
> testing and documenting the driver, mapper, reducer and the combiner. A
> list of milestones with their associated deliverables is given below.
>
> Timeline: April 26  Aug 15.
>
> Milestones:
>
> April 26  May 22 (4 weeks): Pre coding tasks. Open communication with
> my mentor, refine the project's plan and requirements, understand the
> code styling requirements, expand the knowledge on Hadoop and Mahout's
> internals. Thoroughly familiarize with the classes within the
> classifier.sequencelearning.hmm, clustering.kmeans, common, vectorizer
> and math packages.
>
> May 23  June 3 (2 weeks): Driver. Implement and test the class
> HmmDriver by extending the AbstractJob class and by reusing the code
> from the KMeansDriver class.
>
> June 3  July 1 (4 weeks): Mapper. Implement and test the class
> HmmMapper. The HmmMapper class will include setup() and map() methods.
> The setup() method will read in the HmmModel and the parameter values
> obtained from the previous iteration. The map() method will call the
> HmmAlgorithms.backwardAlgorithm() and the
> HmmAlgorithms.forwardAlgorithm() to compute the E step partially.
>
> July 1  July 15 (2 weeks): Reducer. Implement and test the class
> HmmReducer. The reducer will complete the E step by summing over all the
> occurences emitted by the mappers for the three optimization problems.
> Reuse the code from the HmmTrainer.trainBaumWelch() method if possible.
>
> July 15  July 29 (2 weeks): Combiner. Implement and test the class
> HmmCombiner. The combiner will reduce the network traffic and improve
> efficiency by aggregating the values for each of the three keys
> corresponding to each of the optimization problems for the M stage. Look
> at the possibility of code reuse from KMeansCombiner.
>
> July 29  August 15 (2 weeks): Give an example demonstrating the new
> parallel BW algorithm by employing the partsofspeech tagger data set
> also used by the sequential BW. Tidy up code and fix loose ends, conduct
> final tests, finish any remaining documentation.
>
> Additional Information:
>
> I am in the final stages of finishing my Master's degree in Electrical
> and Computer Engineering from the University of Massachusetts Amherst.
> Working under the guidance of Prof. Wayne Burleson, as part of my
> Master's research work, I have applied the theory of Markov Decision
> Process (MDP) to increase the duration of service of mobile computers.
> This semester I am involved with two course projects involving machine
> learning over large data sets. In a Bioinformatics class I am mining the
> RCSB Protein Data Bank to learn the dependence of side chain geometry on
> the protein's secondary structure. In another project for the Online
> Social Networks class, I am building an online recommendation system by
> using the MDP theory and recasting the problem to be an instance of
> sequential optimal control.
>
> I owe much to the open source community as all my research experiments
> have only been possible due to the freely available Linux distributions,
> open source performance analyzers, scripting languages such as Python
> and the suite of GNU tools. I have found the Apache Mahout's developer
> community vibrant, helpful and welcoming. If selected, I feel that the
> GSOC 2011 project will be a great learning experience for me from both a
> technical and professional standpoint and allow me to contribute within
> my modest means to the overall spirit of open coding.
>
> References:
>
> [1] A tutorial on hidden markov models and selected
> applications in speech recognition by Lawrence R. Rabiner. In
> Proceedings of the IEEE, Vol. 77 (1989), pp. 257286.
>
> [2] MapReduce for Machine Learning on Multicore by Cheng T. Chu, Sang
> K. Kim, Yi A. Lin, Yuanyuan Yu, Gary R. Bradski, Andrew Y. Ng, Kunle
> Olukotun. In NIPS (2006), pp. 281288.
>
> [3] DataIntensive Text Processing with MapReduce by Jimmy Lin, Chris
> Dyer. Morgan & Claypool 2010.
>
>
>  Proposal End 
>
>
> On Thu, Mar 24, 2011 at 8:08 PM, Ted Dunning <ted.dunning@gmail.com> wrote:
>
>> I would invert the order of 1 and 2 and estimate the times as 30, 30, 40.
>> You can view this from a few points of view:
>>
>> a) writing good tests is hard
>>
>> b) writing good tests makes writing code much easier
>>
>> c) writing any kind of decent documentation is VASTLY harder than most
>> people think. It is also very important for user adoption, mentor
>> satisfaction and personal marketing (for instance a resume or portfolio)
>>
>> On Thu, Mar 24, 2011 at 4:41 PM, Dhruv Kumar <dkumar@ecs.umass.edu> wrote:
>>
>>> Thanks Ted, I'll start working on a proposal having the following sub
>> tasks
>>> (I have given a rudimentary percent time estimate, please feel free to
>>> suggest alterations):
>>>
>>> 1. Implementing the BW on Map Reduce following the line of kmeans. Focus
>>> on
>>> reusing as much of the existing kmeans code as possible. (60%)
>>>
>>> 2. Unit testing the Mapper, Combiner, Reducer and testing the
>> integration,
>>> in local and pseudodistributed modes. I may be able to get access to a
>>> small cluster at UMass for unit testing in the realdistributed mode.
>> (35%)
>>>
>>> 3. Writing clear documentation directing clients how to use the
>> implemented
>>> library code for their needs. (5%)
>>>
>>>
>>>
>>> On Thu, Mar 24, 2011 at 6:45 PM, Ted Dunning <ted.dunning@gmail.com>
>>> wrote:
>>>
>>>> On Thu, Mar 24, 2011 at 3:34 PM, Dhruv Kumar <dkumar@ecs.umass.edu>
>>> wrote:
>>>>
>>>>> 2. Another very interesting possibility is to express the BW as a
>>>> recursive
>>>>> join. There's a very interesting offshoot of Hadoop, called Haloop (
>>>>> http://code.google.com/p/haloop/) which supports loop control, and
>>>> caching
>>>>> of the intermediate results on the mapper inputs, reducer inputs and
>>>>> reducer outputs to improve performance. The paper [1] describes this
>> in
>>>>> more
>>>>> detail. They have implemented kmeans as a recursive join.
>>>>>
>>>>
>>>> Until there is flexibility around execution model such as the recent
>>>> mapreduce 2.0 announcement
>>>> from Yahoo and until that flexibility is pretty much standard, it is
>> hard
>>>> to
>>>> justify this.
>>>>
>>>> The exception is where such extended capabilities fit into standard
>>> hadoop
>>>> 0.20 environments.
>>>>
>>>
>>>>
>>>>> In either case, I want to clearly define the scope and task list. BW
>>> will
>>>>> be
>>>>> the core of the project but:
>>>>>
>>>>> 1. Does it make sense for implementing the "counting method" for
>> model
>>>>> discovery as well? It is clearly inferior but will it be a good
>>> reference
>>>>> for comparison to the BW. Any added benefit?
>>>>>
>>>>
>>>> No opinion here except that increased scope decreases probability of
>> even
>>>> partial success.
>>>>
>>>>
>>>>> 2. What has been the standard in the past GSoC Mahout projects
>>> regarding
>>>>> unit testing and documentation?
>>>>>
>>>>
>>>> Do it.
>>>>
>>>> Seriously.
>>>>
>>>> We use junit 4+ and very much prefer strong unit tests. Nothing in
>> what
>>>> you
>>>> are proposing should
>>>> require anything interesting in this regard. Testing the mapper,
>>> combiner
>>>> and reducer in isolation is
>>>> good. Testing the integrated program in local mode or pseudo
>> distributed
>>>> mode should suffice beyond
>>>> that. It is best if you can separate command line argument parsing
>> from
>>>> execution path to that you
>>>> can test them separately.
>>>>
>>>>>
>>>>> In the meantime, I've been understanding more about Mahout, Map
>> Reduce
>>>> and
>>>>> Hadoop's internals. One of my course projects this semester is to
>>>> implement
>>>>> the Bellman Iteration algorithm on Map Reduce and so far it has been
>>>> coming
>>>>> along well.
>>>>>
>>>>> Any feedback is much appreciated.
>>>>>
>>>>> Dhruv
>>>>>
>>>>
>>>
>>

Grant Ingersoll
Lucene Revolution  Lucene and Solr User Conference
May 2526 in San Francisco
www.lucenerevolution.org
