mahout-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From conflue...@apache.org
Subject [CONF] Apache Mahout > Parallel Viterbi
Date Sun, 04 Dec 2011 12:35:00 GMT
Space: Apache Mahout (https://cwiki.apache.org/confluence/display/MAHOUT)
Page: Parallel Viterbi (https://cwiki.apache.org/confluence/display/MAHOUT/Parallel+Viterbi)


Edited by Sergey Bartunov:
---------------------------------------------------------------------
Viterbi algorithm is known as inference algorithm (synonyms: segmentation, decoding etc) for
Hidden Markov Model \[1\] which finds the most likely sequence of hidden states by given sequence
of observed states.

Apache Mahout has both [sequential|Hidden Markov Models] and parallel (that's what you're
reading about) implementations of the algorithm.

Detailed presentation about parallel viterbi implementation could be found in [there|http://modis.ispras.ru/seminar/wp-content/uploads/2011/11/Mahout_Viterbi.pdf]
(in russian)

h3. Parallelization strategy

is quite straightforward and based on data parallelizm. There are some studies on Viterbi
(and Belief Propogation which is inference algorithm for loop-less Markov Random Fields and
is quite similar to Viterbi) parallelization, but at the moment of writing this article none
of them seem to be applyable for MapReduce paradigm.

For example, forward pass of Viterbi could be represented in terms of matrix computations
(as being dynamic programming algorithm) an thus essentially paralleled, but overhead for
MapReduce would be greater than profit for parallel matrix multiplication.

Input sequences of observed variables are supposed to be divided into the chunks of some length,
enough to store O(N*K) data in main memory. A set of all chunks number N is called a "serie
number N". The algorithm process the data from serie number N-1 to serie number N (or vice
versa), performing forward and backward Viterbi passes independently for each chunk (and consequently
for each sequence) in reducers. Only data that is nescessary for computation of next serie
is being outputed by direct output of reducers, all other data is collected in background.
For example, when performing forward Viterbi pass only probabilities of last hidden state
are nescessary for the next step, backpointers tables could be written in parallel to local
store since they would be needed only for backward pass.

If all the sequences are of the same length approximately and the number of sequences to decode
is much more that number of reducers, O(N*M/K) time is required to decode them in parallel
(N is number of each sequence, M is number of all sequences, K is number of reducers).

h3. Data format

Each sequence of observed states must be stored in sequence files, where key is the name of
the sequence and value is ObservedSequenceWritable where number of chunk, data length
and data itself are stored. At the moment it is hardcoded requirement, but it seems to be
easy to implement any input file format that will output this information.

The easiest way to get adjust plain text files with space-delimeted numbers of observed states
to this format is to use "bin/mahout hmmchunks".

After parallel Viterbi is ended, decoded sequences will be stored in sequence files, one for
each chunk (key is number of chunk, value is HiddenSequenceWritable). They could be unchunked
to plain text space-delimeted numbers of hidden states by "bin/mahout hmmchunks \-unchunk".

h3. Usage

Run "bin/mahout pviterbi" and see what it wants from you. That is: 

* serialized HmmModel (i.e. by LossyHmmModelSerializer class)
* input data (observed sequences) in the format described above
* paths for temporary storage (i.e. backpointers) and for decoded sequences

*References*

# [Wikipedia article|http://en.wikipedia.org/wiki/Viterbi_algorithm]

Change your notification preferences: https://cwiki.apache.org/confluence/users/viewnotifications.action
   

Mime
View raw message