mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Maxim Arap (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (MAHOUT-1468) Creating a new page for StreamingKMeans documentation on mahout website
Date Wed, 23 Apr 2014 17:01:27 GMT

    [ https://issues.apache.org/jira/browse/MAHOUT-1468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13978451#comment-13978451
] 

Maxim Arap edited comment on MAHOUT-1468 at 4/23/14 5:00 PM:
-------------------------------------------------------------

Andrew: The default initial value for numClusters is 20, which seems arbitrary. As the algorithm
runs, numClusters will grow to roughly k log n, where k is the final number of clusters (that
BallKMeans step will output) and n is the size of the dataset. In practice k log n can be
much larger than 20, depending on the dataset and the final number of clusters k. 

Suneel: I tried running the algorithm both in the sequential mode and in mapreduce mode on
Reuters data last night but both gave me runtime errors. The reason maybe that my laptop has
hadoop-2.2.0, which may not be compatible with mahout at this point. 


was (Author: arapmv):
Andrew: The default initial value for numClusters is 20, which seems arbitrary. As the algorithm
runs, numClusters will grow to roughly k log n, where k is the final number of clusters (that
BallKMeans step will output) and n is the size of the dataset. In practice k log n can be
much larger than 20, depending on the dataset. 

Suneel: I tried running the algorithm both in the sequential mode and in mapreduce mode on
Reuters data last night but both gave me runtime errors. The reason maybe that my laptop has
hadoop-2.2.0, which may not be compatible with mahout at this point. 

> Creating a new page for StreamingKMeans documentation on mahout website
> -----------------------------------------------------------------------
>
>                 Key: MAHOUT-1468
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1468
>             Project: Mahout
>          Issue Type: Documentation
>          Components: Documentation
>    Affects Versions: 1.0
>            Reporter: Pavan Kumar N
>            Assignee: Andrew Musselman
>              Labels: Documentation
>             Fix For: 1.0
>
>         Attachments: StreamingKMeans.txt
>
>
> Separate page required on Streaming K Means algorithm description and overview, explaining
the various parameters can be used in streamingkmeans, strategy for parallelization, link
to this paper: http://papers.nips.cc/paper/3812-streaming-k-means-approximation.pdf



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message