[ https://issues.apache.org/jira/browse/MAHOUT652?page=com.atlassian.jira.plugin.system.issuetabpanels:commenttabpanel&focusedCommentId=13056796#comment13056796
]
Sergey Bartunov commented on MAHOUT652:

Backward pass implementation is complete. It requires a little bit more accurate saving the
resulting sequences and the whole code should be refactored a little to propose the patch.
All the changes are in the github repository.
I could not perform performance test yet because the cluster I wanted to work on is experiencing
some technical issues. I hope I'll perform the test a little bit later.
> [GSoC Proposal] Parallel Viterbi algorithm for HMM
> 
>
> Key: MAHOUT652
> URL: https://issues.apache.org/jira/browse/MAHOUT652
> Project: Mahout
> Issue Type: New Feature
> Affects Versions: 0.5
> Reporter: Sergey Bartunov
> Assignee: Isabel Drost
> Labels: classification, gsoc,, gsoc11, hmm, mahoutgsoc11, sequencelearning
> Fix For: 0.6
>
>
> Proposal Title: Parallel Viterbi algorithm for HMM
> Student Name: Sergey Bartunov
> Student Email: sbos.net@gmail.com
>
> Organization/Project: Apache Mahout
> Assigned Mentor:
> Proposal Abstract:
> The Viterbi Algorithm is an evaluating algorithm for Hidden Markov Model [1]. It estimates
the most likely sequence of hidden states called "Viterbi path" from given sequence of observed
states. Viterbi algorithm is also used in Viterbi training which is less numerical stable
than BaumWelch algorithm but faster and can be adjusted to have near the same accuracy [2].
Hidden Markov Model is widely used for speech and gesture recognition, partofspeech tagging,
bioinformatics etc. At the moment Apache Mahout contains only sequential HMM functionality,
and this project is intended to extend it by implementing MapReduce version of Viterbi algorithm
which would make Mahout able to evaluate HMM on big amounts of data in parallel mode.
> Detailed Description:
> The Viterbi algorithms is a quite common dynamic programming algorithm, it's well described
in many sources such as [3], [4], [5]. As being popular and needed, Viterbi algorithm already
have parallel versions which became a subject of several studies, for example, this paper
about implementation for GPU [6]. Some of these papers may contain useful ideas for mapreduce
implementation.
> There are several possible strategies for parallelization (which one is
> better is an open question for the project) I discovered and posted to the mailing list.
> See [this threadhttp://mailarchives.apache.org/mod_mbox/mahoutdev/201104.mbox/%3CBANLkTin7kZm=6taXckO80Vvm6PS7vb5evg@mail.gmail.com%3E]
> The most reasonable one is:
> Assume we have O  set of oberved state sequences (that's our input data), $O_i$ is ith
sequence with length
> len($O_i$), K is number of hidden states in our model. {$O_{i, t}$} is the observed state
of ith sequence at the moment of time t. Let the N be the maximum of len($O_i$).
> Let's write all the seqences one above other. We will get the matrix which rows are the
input seqeunces. Then take the first L columns of this matrix and name them the "first chunk",
then next L columns and so on. Thus we divide our input into the chunks of size L (less or
equal to L). L is the parameter of the algorithm. So the nth chunk contain subsequences with
$t$ in the iterval [L*n, L*(n+1)).
> The key idea is to process each chunk by standard sequential Viterbi algorithm (the existing
Mahout code could be reused here) in parallel mode one by one using results of previous chunk
processing.
> It will require about O((M/P) * K^2) time where M is sum of all sequence lengths if most
of input sequences have near the same length (the good case). P here is the number of "cores"
/ computational nodes. This time complexity means that such strategy scales well.
> Here is MapReduce scheme for the good case:
> * Map emits (subsequence, probabilites, paths) vector for each subsequence in the chunk,
where probabilities is the initial probabilites in case of the first chunk and optimalpath
probabilities in all other cases, paths means optimal paths to the first observations of each
subsequence.
> * Reduce function just performs sequential Viterbi algorithm on these data and returns
(probabilities, paths).
> * Driver class runs the Map then Reduce for the first chunk, then Map and Reduce for
the second, and so on. Then provide the results (probably using another mapreduce combination).
> Probably, it's possible to use lazy viterbi [7,8] instead of the standard Viterbi for
chunk processing.
> Let's focus now on all other cases when first T chunks contain the subsequences with
near the same length and all other NT chunks do not. Well, then they could be processed using
the next technique (which is strategy number 2 in [the listhttp://mailarchives.apache.org/mod_mbox/mahoutdev/201104.mbox/%3CBANLkTin7kZm=6taXckO80Vvm6PS7vb5evg@mail.gmail.com%3E]):
> At first, denote $V_{t,k}$ as the probabilty that optimal hidden state path goes through
hidden state $k$ at the moment $t$. We need such probabilities to select the most likely (or
optimal) path so we need to compute $V_{t,k}$ for every $t$ and $k$ (that is the part of standard
Viterbi). I omit here the $i$ index of the sequence, because each seqence is the separate
task.
> 1) process each chunk separately. $t$ lie in the [L*n, L*(n+1)) interval, where n is
the number of chunk and to compute Actually to do this, we need to compute $V_{t,k}$ we need
to know $V_{L*n1,k1}$ for each $k1$. But this value belong to the chunk the previous number
(n1) which is processed separately. We need to obtain the max and argmax of $V_{L*n1, k1}$
for each k1.
> 2) Since we don't know the max's and argmax's let's just assume that optimal path goes
through the k1=1 then through k2=2 and so on up to the K, and $V_{L*n1, k1}$ = 1 (or 0 if
we use log(prob) adding instead of prob multiplying) and argmax is k1=1, 2 ... K. Later we
may multiply the $V_{L*(n+1)1, k1}$ by the precise probability max{ $V_{L*n1, k1}$ } (or
add the log of it). Thus we get K^2 possibly optimal paths instead of K.
> 3) After we process all the chunks we start "connecting" phase.
> The first of these NT chunks could be processed by using usual Viterbi processing since
we know the results for the previous "good" chunk. All others need to be connected each to
other starting from connecting second this first then 3th with 4th and so on. During connection
process we throw away all nonoptimal paths (know we know which are optimal and which are
not) and leave just K possibly optimal paths.
> To handle such difficult cases the Driver class should know the sequence lengths to recognize
which MapReduce combination it should use. The resulting time complexity for difficult case
is
> O((T*L / N) / P) * K^2 + (((TN) * L / N)) / P * K^3). The first term of this sum is
time for computing the first T chunks using simple case approach, and the second term is time
for computing the rest using difficult case approach. When T > N overall time tend to
O((M/P) * K^2). T * L / N is about total length of subsequences in the first T chunks and
(((TN) * L / N)) is total length of the rest.
> The only problem not discussed here is how to deal with obtained most likely paths. Path
length is equal to length of a sequence, so it's nood a good idea to emit optimal path in
Map stage. They could be stored separately in the files or HBase since they are required only
in getting the results to the user.
> This implementation should work (or be compatible) with org.apache.mahout.classifier.sequencelearning.hmm.HmmModel
> As you see, Viterbi algorithm use very common dynamic approach. It computes the next
computation layer by combining values the previous, that's all we need to know how to implement
it in the MapReduce and it shares the mentioned difficulties with all other such dynamic
algorithms, so the implentation may be highly resuable.
> Timeline:
> 1. 1 week. Discover Mahout internals, sequential HMM code and code best practices. Think
on problem with path storing. I'll work on this stage actually before GSoC starts but I leave
here 1 week just to be sure.
> 2. 5 days. Write the chunk division routines.
> 3. 5 days. Write the SimpleMapper class for simple good case of data.
> 4. 1 week. Write the SimpleReducer class for simple good case.
> 5. 1 week. Write the Driver class for simple good case which will use SimpleMapper and
SimpleReducer.
> 6. 1 week. Collect debug dataset, write tests for the code and find possible problems,
fix them.
> 7. 10 days. Write the HardReducer class for difficult bad case and rewrite Driver class
to handle such cases. Perform tests, ensure that everything works.
> 7. 10 days. Try to get a large dataset to test performance, scalability etc, write and
perform such a test (here a community help might be needed).
> If there are any problems discuss and analyze them, then fix.
> 8. 1 week. Try to use some optimized Viterbi implentation (i.e. Lazy Viterbi) in chunk
processing phase, measure speed improvement especially for difficult cases.
> 9. 1 week. Write documentation for all classes, description of algorithm in
> wiki and complete example of usage.
> Additional Information:
> The project relates to [this proposalhttps://issues.apache.org/jira/browse/MAHOUT627]
which is about implementing parallel BaumWelch traning algorithm for HMM.
> My motivation for this project is that I need such a functionality as a regular Mahout
user since I face HMM often in my research work. Also I'm very interested in what kinds or
classes of algorithms could be efficiently implemented in MapReduce, I also think that this
is important question for the community and such an experience is significant.
> The important detail is that I have exams in the university until 16th June. After that
I will be ready to work on the project.
> References:
> [1] Lawrence R. Rabiner (February 1989). ["A tutorial on Hidden Markov Models and selected
applications in speech recognition"http://www.cs.cornell.edu/courses/cs481/2004fa/rabiner.pdf].
Proceedings of the IEEE 77 (2): 257286. doi:10.1109/5.18626.
> [2] [Adjusted Viterbi Training. A proof of concept.http://personal.rhul.ac.uk/utah/113/VA/AVTPC.pdf]
> [3] http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.118.2081&rep=rep1&type=pdf
> [4] http://www.cambridge.org/resources/0521882672/7934_kaeslin_dynpro_new.pdf
> [5] http://www.kanungo.com/software/hmmtut.pdf
> [6] http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=5470903
> [7] http://www.scribd.com/doc/48568001/LazyViterbiSlides
> [8] http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=1040367

This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
