From mahout-user-return-1142-apmail-lucene-mahout-user-archive=lucene.apache.org@lucene.apache.org Tue Jul 28 16:19:20 2009 Return-Path: Delivered-To: apmail-lucene-mahout-user-archive@minotaur.apache.org Received: (qmail 32822 invoked from network); 28 Jul 2009 16:19:19 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 28 Jul 2009 16:19:19 -0000 Received: (qmail 70589 invoked by uid 500); 28 Jul 2009 15:20:37 -0000 Delivered-To: apmail-lucene-mahout-user-archive@lucene.apache.org Received: (qmail 70558 invoked by uid 500); 28 Jul 2009 15:20:37 -0000 Mailing-List: contact mahout-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: mahout-user@lucene.apache.org Delivered-To: mailing list mahout-user@lucene.apache.org Received: (qmail 70544 invoked by uid 99); 28 Jul 2009 15:20:37 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 28 Jul 2009 15:20:37 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of nfantone@gmail.com designates 74.125.92.27 as permitted sender) Received: from [74.125.92.27] (HELO qw-out-2122.google.com) (74.125.92.27) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 28 Jul 2009 15:20:27 +0000 Received: by qw-out-2122.google.com with SMTP id 8so54390qwh.53 for ; Tue, 28 Jul 2009 08:20:06 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :date:message-id:subject:from:to:content-type :content-transfer-encoding; bh=YKzeFPyxfTgXr+fwULwd15GzOq0ttkhSnlQNfhFF1oo=; b=wrPlpOfp+N3feqlMdEpno9Js7hUDQ6JFCkyuMz567+vWDYVkR5+SSlHFNL3iiFJfcS slAqIFV+YDY05KcudOAWUu6EywMzCqzGUvdPhfTZRI1GSs+ygM1OBLufGbZ71291kyPN Uy4eZHwmCx0GP32xf+bZLOPFgP5g843fMwmec= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type:content-transfer-encoding; b=YlGexrCWc3emmIxirbt9O5zuIWBTlmeePeVkMI0c01O+pUgg5DV/2BonfuGjpR5iWZ +AloSSQenwzD43qVqYCqDOGVpAxzPR62nXTwP9iiA5vC7qX9wO+B/DdhWMLXqS2TyPdp v9EMEVrGdEt6zAoImKWJfFnYRCLSbCPocsw/Y= MIME-Version: 1.0 Received: by 10.150.199.14 with SMTP id w14mr13719460ybf.259.1248794406820; Tue, 28 Jul 2009 08:20:06 -0700 (PDT) In-Reply-To: References: <37ffc8080906260720w485c1babq9b0b765c07e9e0ac@mail.gmail.com> <1BB95113-E19C-4139-9D75-5D0CF74607D4@apache.org> <17469b150907270955g66885b3mca2a21d989f77145@mail.gmail.com> <37ffc8080907271105x2302cfd5mb87a584d306ca518@mail.gmail.com> <37ffc8080907271133g1725238o61583d818d678922@mail.gmail.com> Date: Tue, 28 Jul 2009 12:20:06 -0300 Message-ID: <37ffc8080907280820g43105907p30b60d91b79d802@mail.gmail.com> Subject: Re: Clustering from DB From: nfantone To: mahout-user@lucene.apache.org Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable X-Virus-Checked: Checked by ClamAV on apache.org Continued in: http://www.nabble.com/Distance-calculation-performance-issue-td24700418.htm= l On Mon, Jul 27, 2009 at 3:38 PM, Grant Ingersoll wrote= : > I think the bigger issue here is we are doing extra work to calculate > distance. =C2=A0I'd suggest hanging on a few days to see if we can get th= at > straightened out. > > On Jul 27, 2009, at 2:33 PM, nfantone wrote: > >>> Well, it does matter to some degree since picking random vectors tends = to >>> give you dense vectors whereas text gives you very sparse vectors. >> >>> Different patterns of sparsity can cause radically different time >>> complexity >> >> for the clustering. >> >> I have yet to find a random combination of vectors that actually >> benefits substantially the performance of kMeans. I have also tried >> real datasets (like the one I was initially using from large amounts >> of data defining consumer's buying habits) to no avail. How should a >> collection of vectors be created to, say, not compromise the algorithm >> functionality significantly? > > -------------------------- > Grant Ingersoll > http://www.lucidimagination.com/ > > Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using > Solr/Lucene: > http://www.lucidimagination.com/search > >