Return-Path: Delivered-To: apmail-lucene-mahout-user-archive@minotaur.apache.org Received: (qmail 31050 invoked from network); 23 Jul 2009 16:50:04 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 23 Jul 2009 16:50:04 -0000 Received: (qmail 23208 invoked by uid 500); 23 Jul 2009 16:51:09 -0000 Delivered-To: apmail-lucene-mahout-user-archive@lucene.apache.org Received: (qmail 23154 invoked by uid 500); 23 Jul 2009 16:51:09 -0000 Mailing-List: contact mahout-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: mahout-user@lucene.apache.org Delivered-To: mailing list mahout-user@lucene.apache.org Received: (qmail 23144 invoked by uid 99); 23 Jul 2009 16:51:09 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 23 Jul 2009 16:51:09 +0000 X-ASF-Spam-Status: No, hits=1.2 required=10.0 tests=SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (nike.apache.org: local policy) Received: from [208.97.132.74] (HELO spunkymail-a7.g.dreamhost.com) (208.97.132.74) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 23 Jul 2009 16:50:58 +0000 Received: from [10.10.10.210] (adsl-065-015-235-012.sip.rmo.bellsouth.net [65.15.235.12]) by spunkymail-a7.g.dreamhost.com (Postfix) with ESMTP id C38425B588 for ; Thu, 23 Jul 2009 09:50:36 -0700 (PDT) Message-Id: <8092565B-5EEC-467C-BD25-258A5E5F34C3@apache.org> From: Grant Ingersoll To: mahout-user@lucene.apache.org In-Reply-To: <37ffc8080907230720t4429386s208c5eb8cfce6b24@mail.gmail.com> Content-Type: text/plain; charset=US-ASCII; format=flowed; delsp=yes Content-Transfer-Encoding: 7bit Mime-Version: 1.0 (Apple Message framework v935.3) Subject: Re: Clustering from DB Date: Thu, 23 Jul 2009 12:50:33 -0400 References: <37ffc8080906260720w485c1babq9b0b765c07e9e0ac@mail.gmail.com> <518b17a0907151446p60e5f38n7a816b0b4c288976@mail.gmail.com> <37ffc8080907160539v182bd462yf1e8bdb611f8066c@mail.gmail.com> <37ffc8080907200930v18ee47c4le01f7eb662bbc42f@mail.gmail.com> <37ffc8080907201158x1688f3c1ydf7bb20205c3956d@mail.gmail.com> <37ffc8080907210505m6e37a0a0m333c0b1c65fbe91c@mail.gmail.com> <37ffc8080907220722p4ea08ee8sd39dd8d2a9738310@mail.gmail.com> <4CB4FAE0-6995-4ECB-ACB1-9A8673B8A032@apache.org> <37ffc8080907230720t4429386s208c5eb8cfce6b24@mail.gmail.com> X-Mailer: Apple Mail (2.935.3) X-Virus-Checked: Checked by ClamAV on apache.org On Jul 23, 2009, at 10:20 AM, nfantone wrote: >> That does seem like a long time. >> >> Is your data sparse or dense? > > I would say sparse. My vectors are high dimensional and most of their > values are zero. > >> Perhaps a larger convergence value might help (-d, I believe). > > I'll try that. > >> Is there any chance your data is publicly shareable? Come to think >> of it, >> with the vector representations, as long as you don't publish the >> key (which >> terms map to which index), I would think most all data is publicly >> shareable. > > I'm sorry, I don't quite understand what you're asking. Publicly > shareable? As in user-permissions to access/read/write the data? As in post a copy of the SequenceFile somewhere for download, assuming you can. Then others could presumably try it out. > >> Are you on trunk of Mahout? I think we still need more profiling >> to get a >> better idea of where improvements can be made. > > I am. Updated this morning. > > I still insist on the configuration issue, and have never considered > Mahout's algorithms implementation to be the actual cause of poor > performance. For now, I've been running kMeans exclusively. Perhaps, I > should try with different clustering methods and see if it takes a > similar amount of time to complete. Well KMeans actually runs two algorithms normally: canopy and then KMeans. You could try the Random seed approach, which would skip the canopy run first.