mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "new user (JIRA)" <j...@apache.org>
Subject [jira] Commented: (MAHOUT-357) Implement a clustering algorithm on mapreduce
Date Fri, 02 Apr 2010 05:32:27 GMT

    [ https://issues.apache.org/jira/browse/MAHOUT-357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12852716#action_12852716
] 

new user commented on MAHOUT-357:
---------------------------------

yeah...it is the k-means algorithm. But, the difference lies in the efficient implementation
of the k-means on a very large data set on the mapreduce framework.By, efficient, I mean that
you don't have to copy the whole data set on each of the nodes. In my implementation, the
data is distributed among the nodes equally and each one of them perform the processing on
the data on their data set only. But, the result would be the same as the whole data is processed
on a single machine. So, the strength of the algorithm lies in this local processing without
copying the whole data set.
I hope , I have answered your doubt. In case ,  you meant something else, please elaborate.

> Implement a clustering algorithm on mapreduce
> ---------------------------------------------
>
>                 Key: MAHOUT-357
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-357
>             Project: Mahout
>          Issue Type: New Feature
>            Reporter: new user
>
> As I mentioned in my previous posts that I am interseted in implementing the clustering
algorithm on mapreduce.So,now I am going to tell what I have thought to implement this. Thinking
of the k-means algorithm for clustering,it appears that the whole set of data has to copied
on each of the nodes of the hadoop framework to process the data in each iteration of the
k-means clustering. But, this can be done without useless replication  of data on the clusters.First
of all, we select a set of k elements as the initial clusters.This can be purely random or
decided on the basis of some criteria.We maintain a file which stores the id of each cluster,
the number of elements in each cluster, and the exact position of the cluster in terms of
its co-ordinates.This file has to be shared by each of the nodes. During each iteration of
the algorithm, the following steps are done:
> 1. As each node has a part of the initial data,during the map phase, it calculates the
distance of each of the element from the k cluster centroids. For each row,the smallest distance
is chosen and the id of the cluster and the position of that element is stored.
> 2.During the combine phase, for each of the cluster, the average of the co-ordinates
for all the elements is calculated and the number of elements in that cluster. So, the combiner
funnction outputs the cluster id and the average co-ordinates of the elements.
> 3. During the reduce phase, the cluster centroid is re-calculated using the weighted
averages of the co-ordinates.
> Thus , after these 3 steps, the new value of centorid for each cluster and the number
of elemnts in each cluster is updated.
> The above three steps can be performed iteratively as long as the condition set for the
convergence is not satisfied, by applying the map-combine-reduce phases again.
> I have proposed this as per my understanding of the probelem and my knowledge. If anybody
have any doubts or want to add anything or suggest anything anything,then please respond as
soon as possible. And, if you consider it a good idea, then please suggest how to proceed
further in the Gsoc procedure.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message