Return-Path: Delivered-To: apmail-lucene-mahout-user-archive@minotaur.apache.org Received: (qmail 81970 invoked from network); 27 Jul 2009 18:37:45 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 27 Jul 2009 18:37:45 -0000 Received: (qmail 94343 invoked by uid 500); 27 Jul 2009 18:38:50 -0000 Delivered-To: apmail-lucene-mahout-user-archive@lucene.apache.org Received: (qmail 94302 invoked by uid 500); 27 Jul 2009 18:38:50 -0000 Mailing-List: contact mahout-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: mahout-user@lucene.apache.org Delivered-To: mailing list mahout-user@lucene.apache.org Received: (qmail 94292 invoked by uid 99); 27 Jul 2009 18:38:49 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 27 Jul 2009 18:38:49 +0000 X-ASF-Spam-Status: No, hits=1.2 required=10.0 tests=SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (athena.apache.org: local policy) Received: from [208.97.132.202] (HELO spunkymail-a13.g.dreamhost.com) (208.97.132.202) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 27 Jul 2009 18:38:39 +0000 Received: from [10.0.0.151] (adsl-065-013-152-164.sip.rdu.bellsouth.net [65.13.152.164]) by spunkymail-a13.g.dreamhost.com (Postfix) with ESMTP id D673C129B28 for ; Mon, 27 Jul 2009 11:38:18 -0700 (PDT) Message-Id: From: Grant Ingersoll To: mahout-user@lucene.apache.org In-Reply-To: <37ffc8080907271133g1725238o61583d818d678922@mail.gmail.com> Content-Type: text/plain; charset=US-ASCII; format=flowed; delsp=yes Content-Transfer-Encoding: 7bit Mime-Version: 1.0 (Apple Message framework v935.3) Subject: Re: Clustering from DB Date: Mon, 27 Jul 2009 14:38:17 -0400 References: <37ffc8080906260720w485c1babq9b0b765c07e9e0ac@mail.gmail.com> <4A6DB488.9020509@windwardsolutions.com> <1BB95113-E19C-4139-9D75-5D0CF74607D4@apache.org> <17469b150907270955g66885b3mca2a21d989f77145@mail.gmail.com> <37ffc8080907271105x2302cfd5mb87a584d306ca518@mail.gmail.com> <37ffc8080907271133g1725238o61583d818d678922@mail.gmail.com> X-Mailer: Apple Mail (2.935.3) X-Virus-Checked: Checked by ClamAV on apache.org I think the bigger issue here is we are doing extra work to calculate distance. I'd suggest hanging on a few days to see if we can get that straightened out. On Jul 27, 2009, at 2:33 PM, nfantone wrote: >> Well, it does matter to some degree since picking random vectors >> tends to give you dense vectors whereas text gives you very sparse >> vectors. > >> Different patterns of sparsity can cause radically different time >> complexity > for the clustering. > > I have yet to find a random combination of vectors that actually > benefits substantially the performance of kMeans. I have also tried > real datasets (like the one I was initially using from large amounts > of data defining consumer's buying habits) to no avail. How should a > collection of vectors be created to, say, not compromise the algorithm > functionality significantly? -------------------------- Grant Ingersoll http://www.lucidimagination.com/ Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using Solr/Lucene: http://www.lucidimagination.com/search