[ https://issues.apache.org/jira/browse/SPARK17716?page=com.atlassian.jira.plugin.system.issuetabpanels:commenttabpanel&focusedCommentId=15960534#comment15960534
]
Nick Pentreath commented on SPARK17716:

I don't think we can expect sufficient committer bandwidth to accommodate this. I'd tend to
put this in the same camp of "deep learning on Spark" where it might be best implemented as
a Spark package. And perhaps RNNs are becoming more used in the ML space too?
cc [~josephkb] [~srowen]?
> Hidden Markov Model (HMM)
> 
>
> Key: SPARK17716
> URL: https://issues.apache.org/jira/browse/SPARK17716
> Project: Spark
> Issue Type: New Feature
> Components: MLlib
> Reporter: Xiangrui Meng
> Assignee: Runxin Li
>
> Had an offline chat with [~Lil'Rex], who implemented HMM on Spark at https://github.com/apache/spark/compare/master...lilrex:sequence.
I asked him to list popular HMM applications, describe public API (params, input/output schemas),
compare its API with existing HMM implementations.
> h1. Hidden Markov Model (HMM) Design Doc
> h2. Overview
> h3. Introduction to HMM
> Hidden Markov Model is a type of statistical Machine Learning model that assumes a sequence
of observations is generated by a Markov process with hidden states. There are 3 (or 2, depending
on the implementation) main components of the model:
> * *Transition Probability*: describes the probability distribution of transitions from
each state to other states (including self) in the Markov process
> * *Emission Probability*: describes the probability distribution for an observation associated
with hidden states
> * *Initial/Start Probability* (optional): represents the prior probability of each state
at the beginning of the observation sequence
> _Note: some implementations merge the Initial Probability into Transition Probability
by adding an arbitrary Start state before the first observation point._
> h3. HMM Models and Algorithms
> Given a limited number of states, most HMM models have the same form of Transition Probability:
a matrix, where each element _(i, j)_ represents the probability of transition from state
_i_ to state _j_. The Initial Probability usually take the simple form of a probabilistic
vector.
> The Emission Probability, on the other hand, can be represented in many different ways,
depending on different nature of observations, i.e. continuous vs. discrete, or different
model assumptions, e.g. single Gaussian vs. Gaussian Mixtures.
> There are three main problems associated with HMM models, and their canonical algorithms:
> # *Evaluation*: What’s the probability of a given observation sequence, based on the
model? It’s usually done by either *Forward* or *Backward* algorithms
> # *Decoding*: What’s the most likely state sequence, given the observation sequence
and the model? It’s usually done by *Viterbi* decoding
> # *Learning*: How to train the parameters of the model based on the observation sequences?
*BaumWelch* (ForwardBackward) is usually used as part of the *EM* algorithm in unsupervised
training
> h3. Popular Applications of HMM
> * Speech Recognition
> * Partofspeech Tagging
> * Named Entity Recognition
> * Machine Translation
> * Gene Prediction
> h2. Alternate Libraries
> [Mahouthttps://mahout.apache.org/users/classification/hiddenmarkovmodels.html]
> * Assume emission probability to be an mbyn matrix
> * Use BaumWelch algorithm for training and Viterbi algorithm for prediction
> * API (command line)
> ** training
> {{$ echo "0 1 2 2 2 1 1 0 0 3 3 3 2 1 2 1 1 1 1 2 2 2 0 0 0 0 0 0 2 2 2 0 0 0 0 0 0 2
2 2 3 3 3 3 3 3 2 3 2 3 2 3 2 1 3 0 0 0 1 0 1 0 2 1 2 1 2 1 2 3 3 3 3 2 2 3 2 1 1 0" >
hmminput}}
> {{$ export MAHOUT_LOCAL=true}}
> {{$ $MAHOUT_HOME/bin/mahout baumwelch i hmminput o hmmmodel nh 3 no 4 e .0001
m 1000}}
> ** prediction
> {{$ $MAHOUT_HOME/bin/mahout hmmpredict m hmmmodel o hmmpredictions l 10}}
> {{$ cat hmmpredictions}}
> [Mallethttp://mallet.cs.umass.edu/api/cc/mallet/fst/HMM.html]
> * Treat HMM as a Finite State Transducer (FST)
> * Theoretically can go beyond firstorder Markov assumption by setting an arbitrary order
> * Limited to text data, i.e. discrete observation sequence with Multinomial emission
model assumption
> * Supervised training only
> * API:
> ** Training:
> {{HMM hmm = new HMM(pipe, null);}}
> {{hmm.addStatesForLabelsConnectedAsIn(trainingInstances);}}
> {{HMMTrainerByLikelihood trainer = new HMMTrainerByLikelihood(hmm);}}
> {{trainer.train(trainingInstances, 10);}}
> ** Testing:
> {{evaluator.evaluate(trainer);}}
> [HMMLearnhttps://github.com/hmmlearn/hmmlearn]
> * Previously part of scikitlearn
> * Algo:
> ** Standard HMM unsupervised training algorithm
> ** Three types of emission models: GMM, Gaussian and Multinomial
> * API:
> ** Training:
> {{model = hmm.GaussianHMM(n_components=3, covariance_type="full")}}
> {{model.fit(X)}}
> ** Testing:
> {{hidden_states = model.predict(X)}}
> h2. API
> Design Goals
> * Build the foundation for general Sequential Tagging models (HMM, CRF, etc.)
> * Support multiple Emission Probability models such as “Multinomial” and “Gaussian
Mixture”
> * Keep both supervised and unsupervised learning for HMM in mind
> h3. Proposed API
> _Note: This is written for the spark.ml API._
> Decoder API
> {code:title=Decoder.scala}
> trait DecoderParams extends Params {
> def featureCol: DataType // column of sequential features, e.g. MatrixUDT
> def predictionCol: DoubleType // column of prediction
> def labelCol: DataType // column of sequential labels, e.g. VectorUDT
> }
> abstract class Decoder extends Estimator with DecoderParams {
> def extractLabeledSequences(dataset: Dataset[_]): RDD[LabeledSequence]
> }
> abstract class DecodingModel extends Model with DecoderParams {
> def numFeatures: Int
> def decode(features: FeatureType): Vector
> }
> {code}
> Tagger API
> {code:title=Tagger.scala}
> trait TaggerParams extends DecoderParams with HasRawPredictionCol {
> def rawPredictionCol: MatrixUDT // column for all predicted label sequences
> }
> abstract class Tagger extends Decoder with TaggerParams
> abstract class TaggingModel extends DecodingModel with TaggerParams {
> def numClasses: Int
> def decodeRaw(features: FeaturesType): Array[(Double, Vector)]
> def raw2prediction(rawPrediction: Array[(Double, Vector)]): Vector
> }
> {code}
> MarginalTagger (Probabilistic Tagger) API
> {code:title=MarginalTagger.scala}
> trait MarginalTaggerParams extends TaggerParams with HasProbabilityCol with HasThreshold
{
> def probabilityCol: DoubleType // column for probability
> }
> abstract class MarginalTagger extends Tagger with MarginalTaggerParams
> abstract class MarginalTaggingModel extends TaggingModel with MarginalTaggerParams {
> def getMargin(featuers: FeaturesType): Double
> }
> {code}
> HiddenMarkovModel API
> {code:title=HiddenMarkovModel.scala}
> trait HiddenMarkovModelParams extends MarginalTaggerParams with HasMaxIter with HasTol
with HasStandardization with HasThreshold {
> def smoothing: DoubleParam
> def emissionType: Param[String] //can be either Multinomial or Gaussian
> }
> class HiddenMarkovModel extends MarginalTagger with HiddenMarkovModelParams {
> def initialModel: Option[HMMModel] //initial model before training
> def def trainSupervised(dataset: Dataset[_]): Option[HMMModel]
> def trainUnsupervised(dataset: Dataset[_]): HMMModel
> }
> abstract class HMMModel extends MarginalTaggingModel with HiddenMarkovModelParams {
> def initialProb: Vector // initial probability for states
> def transitionProb: DenseMatrix // transition probability between states
> //compute feature scores for all states in all frames in a sequence, different for
different emission models, e.g. Multinomial vs. GMM
> def precomputeEmission(features: Matrix): List[Array[Double]]
> // accumulate sufficient statistics for emission model
> def getEmissionStats(features: Matrix, gammas: List[Array[Double]]): DenseMatrix
> // forward algorithm
> def forward(emissions: Traversable[Array[Double]]): List[Array[Double]]
> // backword algorithm
> def backward(emissions: Traversable[Array[Double]]):List[Array[Double]]
> }
> class MultinomialHMMModel extends HMMModel {
> def emissionProb: Matrix // emission probability from states to observations
> }
> object MultinomialHMMModel extends MLReadable[MultinomialHMMModel]
> class HMMModelReader extends MLReader[MultinomialHMMModel]
> {code}

This message was sent by Atlassian JIRA
(v6.3.15#6346)

To unsubscribe, email: issuesunsubscribe@spark.apache.org
For additional commands, email: issueshelp@spark.apache.org
