Continued in: http://www.nabble.com/Distance-calculation-performance-issue-td24700418.html On Mon, Jul 27, 2009 at 3:38 PM, Grant Ingersoll wrote: > I think the bigger issue here is we are doing extra work to calculate > distance.  I'd suggest hanging on a few days to see if we can get that > straightened out. > > On Jul 27, 2009, at 2:33 PM, nfantone wrote: > >>> Well, it does matter to some degree since picking random vectors tends to >>> give you dense vectors whereas text gives you very sparse vectors. >> >>> Different patterns of sparsity can cause radically different time >>> complexity >> >> for the clustering. >> >> I have yet to find a random combination of vectors that actually >> benefits substantially the performance of kMeans. I have also tried >> real datasets (like the one I was initially using from large amounts >> of data defining consumer's buying habits) to no avail. How should a >> collection of vectors be created to, say, not compromise the algorithm >> functionality significantly? > > -------------------------- > Grant Ingersoll > http://www.lucidimagination.com/ > > Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using > Solr/Lucene: > http://www.lucidimagination.com/search > >