mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From nfantone <nfant...@gmail.com>
Subject Re: Clustering from DB
Date Tue, 28 Jul 2009 15:20:06 GMT
Continued in:
http://www.nabble.com/Distance-calculation-performance-issue-td24700418.html

On Mon, Jul 27, 2009 at 3:38 PM, Grant Ingersoll<gsingers@apache.org> wrote:
> I think the bigger issue here is we are doing extra work to calculate
> distance.  I'd suggest hanging on a few days to see if we can get that
> straightened out.
>
> On Jul 27, 2009, at 2:33 PM, nfantone wrote:
>
>>> Well, it does matter to some degree since picking random vectors tends to
>>> give you dense vectors whereas text gives you very sparse vectors.
>>
>>> Different patterns of sparsity can cause radically different time
>>> complexity
>>
>> for the clustering.
>>
>> I have yet to find a random combination of vectors that actually
>> benefits substantially the performance of kMeans. I have also tried
>> real datasets (like the one I was initially using from large amounts
>> of data defining consumer's buying habits) to no avail. How should a
>> collection of vectors be created to, say, not compromise the algorithm
>> functionality significantly?
>
> --------------------------
> Grant Ingersoll
> http://www.lucidimagination.com/
>
> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using
> Solr/Lucene:
> http://www.lucidimagination.com/search
>
>

Mime
View raw message