mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robin Anil (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (MAHOUT-1233) Problem in processing datasets as a single chunk vs many chunks in HADOOP mode in mostly all the clustering algos
Date Sun, 02 Jun 2013 19:44:22 GMT

     [ https://issues.apache.org/jira/browse/MAHOUT-1233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Robin Anil updated MAHOUT-1233:
-------------------------------

    Fix Version/s: 0.8
    
> Problem in processing datasets as a single chunk vs many chunks in HADOOP mode in mostly
all the clustering algos
> -----------------------------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-1233
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1233
>             Project: Mahout
>          Issue Type: Question
>          Components: Clustering
>    Affects Versions: 0.7, 0.8
>            Reporter: yannis ats
>            Assignee: yannis ats
>            Priority: Minor
>             Fix For: 0.8
>
>
> I am trying to process a dataset and i do it in two ways.
> Firstly i give it as a single chunk(all the dataset) and secondly as many smaller chunks
in order to increase the throughput of my machine.
> The problem is that when i perform the single chunk computation the results are fine

> and by fine i mean that if i have in the input 1000 vectors i get in the output 1000
vectorids with their cluster_ids (i have tried in canopy,kmeans and fuzzy kmeans).
> However when i split the dataset in order to speed up the computations then strange phenomena
occur.
> For instance the same dataset that contains 1000 vectors and is split in  for example
10 files then in the output i will obtain more vector ids(w.g 1100 vectorids with their corresponding
clusterids).
> The question is, am i doing something wrong in the process?
> Is there a problem in clusterdump and seqdumper when the input is in many files?
> I have observed when mahout is performing the computations that in the screen says that
processed the correct number of vectors.
> Am i missing something?
> I use as input the transformed to mvc weka vectors.
> I have tried this in v0.7 and the v0.8 snapshot.
> Thank you in advance for your time.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message