mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Isabel Drost (JIRA)" <j...@apache.org>
Subject [jira] Commented: (MAHOUT-4) Simple prototype for Expectation Maximization (EM)
Date Mon, 07 Apr 2008 09:10:24 GMT

    [ https://issues.apache.org/jira/browse/MAHOUT-4?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12586287#action_12586287
] 

Isabel Drost commented on MAHOUT-4:
-----------------------------------

Adding the comments sent to the list here as well for further reference.

> ----------------------------
> So here is a short write-up in my words, please feel free to
> fill any gaps/errors found

I will try to do so from my perspective, maybe others can add their views.


> Expectation Maximization for clustering
> -----------------------------------------------------
>  Let
>      z = unobserved data, clusters in our case.
>      y = observed data, points in our case.
>
>  p(y1|z1) + p(y2|z1) + p(y3|z1) + p(y4|z1) = 1
>  p(z1) + p(z2) = 1

Looks correct to me.


>  E-Step.
>  ------
>  M-Step
>  ------

I could not find an error in neither of the two steps so far.


> Questions
> =========
> 1. When and how do we re-compute the cluster centers ?

EM does not work with explicit cluster centers. In kmeans you iterate two 
steps: Assigning points to centers and recomputing the centers. In EM you 
again iterate two steps: Computing the probabilities for each point belonging 
to the clusters (so you do not assign them hard to one cluster, you only say 
with probability P it belongs to clusters i to k), in the second step you 
recompute the parameters of each cluster - the cluster center is influenced 
by each point but only weighted by its probability of belonging to this 
cluster.
 

> 2. As per my understanding points and clusters are simply labels with some
>    conditional probability assigned to them. A distance metric like one
>    used in K-means is nowhere involved, is that correct ?

Yes and no: Technically no, conceptually, your computation for the probability 
of assigning a point to a cluster should be based on the point's distance to 
the cluster.

I hope I did not cause more confusion than helping you. Maybe others can 
correct me or clarify what I left unclear...

Isabel

> Simple prototype for Expectation Maximization (EM)
> --------------------------------------------------
>
>                 Key: MAHOUT-4
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-4
>             Project: Mahout
>          Issue Type: New Feature
>            Reporter: Ankur
>         Attachments: Mahout_EM.patch
>
>
> Create a simple prototype implementing Expectation Maximization - EM that demonstrates
the algorithm functionality given a set of (user, click-url) data.
> The prototype should be functionally complete and should serve as a basis for the Map-Reduce
version of the EM algorithm.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message