mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Suneel Marthi (JIRA)" <>
Subject [jira] [Commented] (MAHOUT-1431) Comparison of Mahout 0.8 vs mahout 0.9 in EMR
Date Sat, 08 Mar 2014 12:28:43 GMT


Suneel Marthi commented on MAHOUT-1431:

yannis, could u provide explicit details as to the task name. KMeans has 2 mapper tasks and
a reducer task.  Which mapper is taking longer time?  I would expect ClusterClassificationMapper
to be slower due to the additional overhead of calculating the vector distance from the centroid
and converting the vector to a named vector. 

> Comparison of Mahout 0.8 vs mahout 0.9 in EMR
> ---------------------------------------------
>                 Key: MAHOUT-1431
>                 URL:
>             Project: Mahout
>          Issue Type: Question
>          Components: Clustering
>    Affects Versions: 0.8, 0.9
>            Reporter: yannis ats
>              Labels: performance
> Hi all,
> i tested mahout 0.8 and 0.9 in mahout emr with a large dataset as input and 
> i performed kmeans experiments with both versions in amazon EMR.
> What i found is that mahout 0.8 is faster than mahout 0.9
> in particular i observed that mahout 0.8 is performing less iterations and every iteration
of kmeans is faster than mahout 0.9.Every iteration in mahout 0.8 is twice as fast as that
of 0.9
> the hadoop version was 1.0.x and the input of the data was roughly 2 million datapoints
with dimensionality of 1800.
> The input parameters in both experiments were exactly the same,modulo the initialization
which was random in both cases and i can understand that this may affect the convergence(the
amount of iterations),but i am baffled by the fact that every iteration takes almost twice
the time in 0.9 vs 0.8
> Is this normal?is this  expected?
> thank you in advance for your time.

This message was sent by Atlassian JIRA

View raw message