mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From nfantone <nfant...@gmail.com>
Subject Re: Distance calculation performance issue
Date Wed, 29 Jul 2009 13:07:04 GMT
Grant, I took a look at your patch. It seems as though you did
something similar to what I did. However, I believe that there's still
room for improvement as there are things being calculated
unnecessarily for no apparent reason. Could you please read my
previous post? At least the "excursus" bit. I may be totally wrong,
though: some particular parts were a bit obscure to me. Perhaps you
(or Shashikant) can throw some light in there? We might be able to
release a bigger/better patch.

>>  I think your data set ran, for 10 iterations, in just over 2 minutes
>> and that was with the profiler hooked up, too.

Um... I also did that and, while it was considerably faster than
before, it took about ~2hs to complete (it used to take days, mind
you), using a 4 node hadoop cluster. The actual vector clustering
only, that is the final step, took just over an hour:

Started at: Tue Jul 28 17:44:20 ART 2009
Finished at: Tue Jul 28 18:46:24 ART 2009
Finished in: 1hrs, 2mins, 4sec

How exactly did you launch the job? What convergence delta did you
choose? Hoy many clusters did you set up initially?

Mime
View raw message