mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "yannis ats (JIRA)" <>
Subject [jira] [Commented] (MAHOUT-1233) Problem in processing datasets as a single chunk vs many chunks in HADOOP mode in mostly all the clustering algos
Date Sat, 08 Jun 2013 12:25:20 GMT


yannis ats commented on MAHOUT-1233:

for some strange reason i obtain errors in meanshift with my dataset
if i use any other dataset (like some synthesized data from gaussians that have small dimensionality)
i think that meanshift is working fine.
i know that there is no problem in the dataset since it is working ok for canopy,kmeans and
fuzzy kmeans
something that i have observed and probably you have an answer is that for the case of fuzzy
kmeans i obtain a lot of "empty clusters"
for instance if want to cluster my dataset with 1600 clusters with fuzzy kmeans i see that
i obtain ~1300 non empty clusters
is this normal?
if i do the same experiment with kmeans i obtain clustering with 1600 clusters(i dont see
any empty clusters)
i will try to see with other datasets what is going on 

> Problem in processing datasets as a single chunk vs many chunks in HADOOP mode in mostly
all the clustering algos
> -----------------------------------------------------------------------------------------------------------------
>                 Key: MAHOUT-1233
>                 URL:
>             Project: Mahout
>          Issue Type: Question
>          Components: Clustering
>    Affects Versions: 0.7, 0.8
>            Reporter: yannis ats
>            Assignee: yannis ats
>            Priority: Minor
>             Fix For: 0.8
> I am trying to process a dataset and i do it in two ways.
> Firstly i give it as a single chunk(all the dataset) and secondly as many smaller chunks
in order to increase the throughput of my machine.
> The problem is that when i perform the single chunk computation the results are fine

> and by fine i mean that if i have in the input 1000 vectors i get in the output 1000
vectorids with their cluster_ids (i have tried in canopy,kmeans and fuzzy kmeans).
> However when i split the dataset in order to speed up the computations then strange phenomena
> For instance the same dataset that contains 1000 vectors and is split in  for example
10 files then in the output i will obtain more vector ids(w.g 1100 vectorids with their corresponding
> The question is, am i doing something wrong in the process?
> Is there a problem in clusterdump and seqdumper when the input is in many files?
> I have observed when mahout is performing the computations that in the screen says that
processed the correct number of vectors.
> Am i missing something?
> I use as input the transformed to mvc weka vectors.
> I have tried this in v0.7 and the v0.8 snapshot.
> Thank you in advance for your time.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see:

View raw message