mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From nfantone <>
Subject Re: Clustering from DB
Date Thu, 23 Jul 2009 14:20:33 GMT
> That does seem like a long time.
> Is your data sparse or dense?

I would say sparse. My vectors are high dimensional and most of their
values are zero.

> Perhaps a larger convergence value might help (-d, I believe).

I'll try that.

> Is there any chance your data is publicly shareable?  Come to think of it,
> with the vector representations, as long as you don't publish the key (which
> terms map to which index), I would think most all data is publicly
> shareable.

I'm sorry, I don't quite understand what you're asking. Publicly
shareable? As in user-permissions to access/read/write the data?

> Are you on trunk of Mahout?  I think we still need more profiling to get a
> better idea of where improvements can be made.

I am. Updated this morning.

I still insist on the configuration issue, and have never considered
Mahout's algorithms implementation to be the actual cause of poor
performance. For now, I've been running kMeans exclusively. Perhaps, I
should try with different clustering methods and see if it takes a
similar amount of time to complete.

View raw message