mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Grant Ingersoll (JIRA)" <>
Subject [jira] Commented: (MAHOUT-121) Speed up distance calculations for sparse vectors
Date Fri, 07 Aug 2009 14:41:14 GMT


Grant Ingersoll commented on MAHOUT-121:

bq. Please, feel free to contradict me here - That was the whole point: getStd() is NEVER

Ah, I see now.  We were in deed calculating it in computeCentroid lots of times, but it is
only ever used for DisplayKMeans.  You are correct.

As for loop unrolling, etc. it is usually best to let the compiler take care of that.  As
for strings, you should never, ever use String for concatenation.  It is a horrible performance
drain.  StringBuilder is definitely the way to go.  Especially be on the look out for String
concats in logging statements that aren't guarded by if (log.isDebugEnabled())

I will fix the lengthSquared comparison and commit.


> Speed up distance calculations for sparse vectors
> -------------------------------------------------
>                 Key: MAHOUT-121
>                 URL:
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>         Attachments: Canopy_Wiki_1000-2009-06-24.snapshot, doc-vector-4k, MAHOUT-121-cluster-distance.patch,
MAHOUT-121-distance-optimization.patch, MAHOUT-121-new-distance-optimization.patch, mahout-121.patch,
MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch,
mahout-121.patch, MAHOUT-121jfe.patch, Mahout1211.patch
> From my mail to the Mahout mailing list.
> I am working on clustering a dataset which has thousands of sparse vectors. The complete
dataset has few tens of thousands of feature items but each vector has only couple of hundred
feature items. For this, there is an optimization in distance calculation, a link to which
I found the archives of Mahout mailing list.
> I tried out this optimization.  The test setup had 2000 document  vectors with few hundred
items.  I ran canopy generation with Euclidean distance and t1, t2 values as 250 and 200.
> Current Canopy Generation: 28 min 15 sec.
> Canopy Generation with distance optimization: 1 min 38 sec.
> I know by experience that using Integer, Double objects instead of primitives is computationally
expensive. I changed the sparse vector  implementation to used primitive collections by Trove
> ].
> Distance optimization with Trove: 59 sec
> Current canopy generation with Trove: 21 min 55 sec
> To sum, these two optimizations reduced cluster generation time by a 97%.
> Currently, I have made the changes for Euclidean Distance, Canopy and KMeans.  
> Licensing of Trove seems to be an issue which needs to be addressed.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message