nfantone wrote:
>> That does seem like a long time.
>>
>> Is your data sparse or dense?
>>
>
> I would say sparse. My vectors are high dimensional and most of their
> values are zero.
>
>
>> Perhaps a larger convergence value might help (-d, I believe).
>>
>
> I'll try that.
>
>
>> Is there any chance your data is publicly shareable? Come to think of it,
>> with the vector representations, as long as you don't publish the key (which
>> terms map to which index), I would think most all data is publicly
>> shareable.
>>
>
> I'm sorry, I don't quite understand what you're asking. Publicly
> shareable? As in user-permissions to access/read/write the data?
>
>
>> Are you on trunk of Mahout? I think we still need more profiling to get a
>> better idea of where improvements can be made.
>>
>
> I am. Updated this morning.
>
> I still insist on the configuration issue, and have never considered
> Mahout's algorithms implementation to be the actual cause of poor
> performance. For now, I've been running kMeans exclusively. Perhaps, I
> should try with different clustering methods and see if it takes a
> similar amount of time to complete.
>
>
>
That does seem like an awfully long time for 62 MB on a 6 node cluster.
How many iterations are running? Were they capped at 32 or did it run
longer? How did you generate your initial clusters? Where are the
iteration jobs spending most of their time (map vs. reduce) Could you
share a copy of your data file so we can take a look at it? If it is
just un-annotated vectors there should be no IP issues.
I've run KMeans over gigabytes of data on 10-node clusters and the jobs
terminate in a few minutes. That is what I would expect from your job.
You could try Canopy on your data. This is a single-pass algorithm that
should take approximately as long as one iteration of KMeans.
Jeff
|