mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From nfantone <>
Subject Re: Clustering from DB
Date Wed, 15 Jul 2009 21:16:55 GMT
Um... Here I am bringing news that are somewhat inconsistent with your
suggestion: CanopyDriver runs its job just fine with the very same
dataset. It sure takes a while, but it finished in an acceptable time.
Unless the convergence condition for the algorithms are radically
different, I'd say there's something odd going on. Of course, I'll
take into consideration what you mentioned about adding nodes to my
cluster, although it doesn't depend entirely on me.

On Wed, Jul 15, 2009 at 5:55 PM, Jeff Eastman<> wrote:
> nfantone wrote:
>> Well, I grew tired of watching the whole thing run and stopped it. I,
>> then, started another test, this time around using a smaller dataset
>> of 3Gb and it is still taking way too long.
>> See inline comments.
>>> You are only specifying a single reducer. Try increasing that as below.
>> I did. I set it to my K value (200).
> Way too big given your single node operation. See below.
>>> No, number of nodes is the number of nodes (computers) in your cluster.
>>> You
>>> did not say how many nodes you are running on.
>> I'm running and compiling the application on one simple desktop
>> computer at work, and that isn't likely to change after the
>> development process is finished.
> This is the root of your problem: You only have a single node in your
> cluster. Running Hadoop in this configuration is possible, but it will be
> much slower than if you had more machines. Perhaps you can get some interest
> from some of your other colleagues in donating some storage and cycles on
> their machines to your effort. When I was at CollabNet, I got a dozen
> developer's machines running in a cluster so I could test out the early
> clustering stuff. These machines typically had gigs of free storage and were
> not heavily utilized in CPU capacity, so nobody ever noticed I was running
> jobs on them at all.
> Alternatively, for a couple of dollars on AWS you can run the job on a
> cluster of your own. For your job I would expect the cost to be literally in
> the couple of dollars range.
> You will find KMeans will scale almost linearly with the number of boxes you
> throw at it.

View raw message