mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "yannis ats (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAHOUT-1233) Problem in processing datasets as a single chunk vs many chunks in HADOOP mode in mostly all the clustering algos
Date Sat, 08 Jun 2013 12:25:20 GMT

    [ https://issues.apache.org/jira/browse/MAHOUT-1233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13678738#comment-13678738
] 

yannis ats commented on MAHOUT-1233:
------------------------------------

for some strange reason i obtain errors in meanshift with my dataset
if i use any other dataset (like some synthesized data from gaussians that have small dimensionality)
i think that meanshift is working fine.
i know that there is no problem in the dataset since it is working ok for canopy,kmeans and
fuzzy kmeans
.
something that i have observed and probably you have an answer is that for the case of fuzzy
kmeans i obtain a lot of "empty clusters"
for instance if want to cluster my dataset with 1600 clusters with fuzzy kmeans i see that
i obtain ~1300 non empty clusters
is this normal?
if i do the same experiment with kmeans i obtain clustering with 1600 clusters(i dont see
any empty clusters)
i will try to see with other datasets what is going on 



                
> Problem in processing datasets as a single chunk vs many chunks in HADOOP mode in mostly
all the clustering algos
> -----------------------------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-1233
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1233
>             Project: Mahout
>          Issue Type: Question
>          Components: Clustering
>    Affects Versions: 0.7, 0.8
>            Reporter: yannis ats
>            Assignee: yannis ats
>            Priority: Minor
>             Fix For: 0.8
>
>
> I am trying to process a dataset and i do it in two ways.
> Firstly i give it as a single chunk(all the dataset) and secondly as many smaller chunks
in order to increase the throughput of my machine.
> The problem is that when i perform the single chunk computation the results are fine

> and by fine i mean that if i have in the input 1000 vectors i get in the output 1000
vectorids with their cluster_ids (i have tried in canopy,kmeans and fuzzy kmeans).
> However when i split the dataset in order to speed up the computations then strange phenomena
occur.
> For instance the same dataset that contains 1000 vectors and is split in  for example
10 files then in the output i will obtain more vector ids(w.g 1100 vectorids with their corresponding
clusterids).
> The question is, am i doing something wrong in the process?
> Is there a problem in clusterdump and seqdumper when the input is in many files?
> I have observed when mahout is performing the computations that in the screen says that
processed the correct number of vectors.
> Am i missing something?
> I use as input the transformed to mvc weka vectors.
> I have tried this in v0.7 and the v0.8 snapshot.
> Thank you in advance for your time.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message