[ https://issues.apache.org/jira/browse/MAHOUT396?page=com.atlassian.jira.plugin.system.issuetabpanels:alltabpanel
]
Isabel Drost updated MAHOUT396:

Status: Resolved (was: Patch Available)
Resolution: Fixed
Changes committed.
> Proposal for Implementing Hidden Markov Model
> 
>
> Key: MAHOUT396
> URL: https://issues.apache.org/jira/browse/MAHOUT396
> Project: Mahout
> Issue Type: New Feature
> Affects Versions: 0.4
> Reporter: Max Heimel
> Assignee: Max Heimel
> Priority: Minor
> Fix For: 0.4
>
> Attachments: MAHOUT396.diff, MAHOUT396.diff, MAHOUT396.diff, MAHOUT396.diff,
MAHOUT396.diff, MAHOUT396.diff, MAHOUT396.patch
>
>
> h4. Overview
> This is a project proposal for a summerterm university project to write a (sequential)
HMM implementation for Mahout. Five students will work on this project as part of a course
mentored by Isabel Drost.
> h4. Abstract:
> Hidden Markov Models are used in multiple areas of Machine Learning, such as speech recognition,
handwritten letter recognition or natural language processing. A Hidden Markov Model (HMM)
is a statistical model of a process consisting of two (in our case discrete) random variables
O and Y, which change their state sequentially. The variable Y with states {y_1, ... , y_n}
is called the "hidden variable", since its state is not directly observable. The state of
Y changes sequentially with a so called  in our case firstorder  Markov Property. This
means, that the state change probability of Y only depends on its current state and does not
change in time. Formally we write: P(Y(t+1)=y_iY(0)...Y(t)) = P(Y(t+1)=y_iY(t)) = P(Y(2)=y_iY(1)).
The variable O with states {o_1, ... , o_m} is called the "observable variable", since its
state can be directly observed. O does not have a Markov Property, but its state propability
depends statically on the current state of Y.
> Formally, an HMM is defined as a tuple M=(n,m,P,A,B), where n is the number of hidden
states, m is the number of observable states, P is an ndimensional vector containing initial
hidden state probabilities, A is the nxndimensional "transition matrix" containing the transition
probabilities such that A[i,j]=P(Y(t)=y_iY(t1)=y_j) and B is the mxndimensional "observation
matrix" containing the observation probabilties such that B[i,j]= P(O=o_iY=y_j).
> Rabiner [[1My Page#reference1]] defined three main problems for HMM models:
> # Evaluation: Given a sequence O of observations and a model M, what is the probability
P(OM) that sequence O was generated by model M. The Evaluation problem can be efficiently
solved using the Forward algorithm
> # Decoding: Given a sequence O of observations and a model M, what is the most likely
sequence Y*=argmax(Y) P(OM,Y) of hidden variables to generate this sequence. The Decoding
problem can be efficiently sovled using the Viterbi algorithm.
> # Learning: Given a sequence O of observations, what is the most likely model M*=argmax(M)P(OM)
to generate this sequence. The Learning problem can be efficiently solved using the BaumWelch
algorithm.
> The target of each milestone is defined as the implementation for the given goals, the
respective documentation and unit tests for the implementation.
> h4.Timeline
> Mid of May 2010  Mid of July 2010
> h4.Milestones
> I) Define an HMM class based on Apache Mahout Math package offering interfaces to set
model parameters, perform consistency checks, perform output prediction.
> 1 week from May 18th till May 25th.
> II) Write sequential implementations of forward (cf. problem 1 [[1My Page#reference1]])
and backward algorithm.
> 2 weeks from May 25th till June 8th.
> III) Write a sequential implementation of Viterbi algorithm (cf. problem 2 [[1My Page#reference1]]),
based on existing forward algorithm implementation.
> 2 weeks from June 8th till June 22nd
> IV) Have a running sequential implementation of BaumWelch algorithm for model parameter
learning (application II [ref]), based on existing forward/backward algorithm implementation.
> 2 weeks from June 8th till June 22nd
> V) Provide a usage example of HMM implementation, demonstrating all three problems.
> 2 weeks from June 22nd till July 6th
> VI) Finalize documentation and implemenation, clean up open ends.
> 1 week from July 6th till July 13th
> h4.References:
> {anchor:reference1}[[1http://www.cs.ubc.ca/~murphyk/Bayes/rabiner.pdf]] Lawrence
R. Rabiner (February 1989). "A tutorial on Hidden Markov Models and selected applications
in speech recognition". Proceedings of the IEEE 77 (2): 257286. doi:10.1109/5.18626.
> Potential test data sets:
> [http://www.cnts.ua.ac.be/conll2000/chunking/]
> [http://archive.ics.uci.edu/ml/datasets/Character+Trajectories]

This message is automatically generated by JIRA.

You can reply to this email to add a comment to the issue online.
