> That does seem like a long time.
>
> Is your data sparse or dense?
I would say sparse. My vectors are high dimensional and most of their
values are zero.
> Perhaps a larger convergence value might help (-d, I believe).
I'll try that.
> Is there any chance your data is publicly shareable? Come to think of it,
> with the vector representations, as long as you don't publish the key (which
> terms map to which index), I would think most all data is publicly
> shareable.
I'm sorry, I don't quite understand what you're asking. Publicly
shareable? As in user-permissions to access/read/write the data?
> Are you on trunk of Mahout? I think we still need more profiling to get a
> better idea of where improvements can be made.
I am. Updated this morning.
I still insist on the configuration issue, and have never considered
Mahout's algorithms implementation to be the actual cause of poor
performance. For now, I've been running kMeans exclusively. Perhaps, I
should try with different clustering methods and see if it takes a
similar amount of time to complete.
|