mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <>
Subject Re: Distance calculation performance issue
Date Thu, 30 Jul 2009 01:49:24 GMT

On Jul 29, 2009, at 9:07 AM, nfantone wrote:

> Grant, I took a look at your patch. It seems as though you did
> something similar to what I did. However, I believe that there's still
> room for improvement as there are things being calculated
> unnecessarily for no apparent reason. Could you please read my
> previous post? At least the "excursus" bit. I may be totally wrong,
> though: some particular parts were a bit obscure to me. Perhaps you
> (or Shashikant) can throw some light in there? We might be able to
> release a bigger/better patch.

Agreed, can you put your changes up as a patch on MAHOUT-121?  That  
way we can do file diffs, etc.

>>>  I think your data set ran, for 10 iterations, in just over 2  
>>> minutes
>>> and that was with the profiler hooked up, too.
> Um... I also did that and, while it was considerably faster than
> before, it took about ~2hs to complete (it used to take days, mind
> you), using a 4 node hadoop cluster. The actual vector clustering
> only, that is the final step, took just over an hour:
> Started at: Tue Jul 28 17:44:20 ART 2009
> Finished at: Tue Jul 28 18:46:24 ART 2009
> Finished in: 1hrs, 2mins, 4sec
> How exactly did you launch the job? What convergence delta did you
> choose? Hoy many clusters did you set up initially?

--input ../nfantone/ --clusters ../nfantone/output/clusters -- 
k 10 --output ../content/nfantone/output/ --convergence 0.01 --overwrite

So, it wasn't exactly what you were running.  I will try to run your's  
at some point.


View raw message