mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jeff Eastman <j...@windwardsolutions.com>
Subject Re: Clustering from DB
Date Thu, 23 Jul 2009 16:50:10 GMT
nfantone wrote:
>> That does seem like a long time.
>>
>> Is your data sparse or dense?
>>     
>
> I would say sparse. My vectors are high dimensional and most of their
> values are zero.
>
>   
>> Perhaps a larger convergence value might help (-d, I believe).
>>     
>
> I'll try that.
>
>   
>> Is there any chance your data is publicly shareable?  Come to think of it,
>> with the vector representations, as long as you don't publish the key (which
>> terms map to which index), I would think most all data is publicly
>> shareable.
>>     
>
> I'm sorry, I don't quite understand what you're asking. Publicly
> shareable? As in user-permissions to access/read/write the data?
>
>   
>> Are you on trunk of Mahout?  I think we still need more profiling to get a
>> better idea of where improvements can be made.
>>     
>
> I am. Updated this morning.
>
> I still insist on the configuration issue, and have never considered
> Mahout's algorithms implementation to be the actual cause of poor
> performance. For now, I've been running kMeans exclusively. Perhaps, I
> should try with different clustering methods and see if it takes a
> similar amount of time to complete.
>
>
>   
That does seem like an awfully long time for 62 MB on a 6 node cluster. 
How many iterations are running? Were they capped at 32 or did it run 
longer? How did you generate your initial clusters? Where are the 
iteration jobs spending most of their time (map vs. reduce) Could you 
share a copy of your data file so we can take a look at it? If it is 
just un-annotated vectors there should be no IP issues.

I've run KMeans over gigabytes of data on 10-node clusters and the jobs 
terminate in a few minutes. That is what I would expect from your job.

You could try Canopy on your data. This is a single-pass algorithm that 
should take approximately as long as one iteration of KMeans.

Jeff

Mime
View raw message