Return-Path: Delivered-To: apmail-lucene-mahout-user-archive@minotaur.apache.org Received: (qmail 44546 invoked from network); 17 Sep 2009 12:57:36 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 17 Sep 2009 12:57:36 -0000 Received: (qmail 54690 invoked by uid 500); 17 Sep 2009 12:57:36 -0000 Delivered-To: apmail-lucene-mahout-user-archive@lucene.apache.org Received: (qmail 54653 invoked by uid 500); 17 Sep 2009 12:57:36 -0000 Mailing-List: contact mahout-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: mahout-user@lucene.apache.org Delivered-To: mailing list mahout-user@lucene.apache.org Received: (qmail 54643 invoked by uid 99); 17 Sep 2009 12:57:36 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 17 Sep 2009 12:57:36 +0000 X-ASF-Spam-Status: No, hits=1.2 required=10.0 tests=SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (athena.apache.org: local policy) Received: from [208.97.132.74] (HELO homiemail-a15.g.dreamhost.com) (208.97.132.74) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 17 Sep 2009 12:57:27 +0000 Received: from [10.9.245.248] (72-254-61-56.client.stsn.net [72.254.61.56]) by homiemail-a15.g.dreamhost.com (Postfix) with ESMTPA id 8EB4C76C062 for ; Thu, 17 Sep 2009 05:57:06 -0700 (PDT) Content-Type: text/plain; charset=us-ascii; format=flowed; delsp=yes Mime-Version: 1.0 (Apple Message framework v1076) Subject: Re: Some basic introductory questions From: Grant Ingersoll In-Reply-To: <6475fa040909170036u24a49a76id8e7f531ef07abbd@mail.gmail.com> Date: Thu, 17 Sep 2009 05:57:05 -0700 Content-Transfer-Encoding: 7bit Message-Id: References: <6475fa040909170036u24a49a76id8e7f531ef07abbd@mail.gmail.com> To: mahout-user@lucene.apache.org X-Mailer: Apple Mail (2.1076) X-Virus-Checked: Checked by ClamAV on apache.org On Sep 17, 2009, at 12:36 AM, Aleksander Stensby wrote: > Hi all, > I've been following the development of Mahout for quite a while now > and > figured it was time for me to get my hands dirty:) > > I've gone through the examples and Grant's excellent IBM article > (great work > on that Grant!). Thanks! > So, now I'm at the point where I want to figure out where I go next. > Specifically, I'm a bit fuzzed about common practices when it comes to > utilizing Mahout in my own applications... > > Case scenario: > I have my own project, add the dependencies to Mahout (through > maven), and > make my own little kMeans test class. > I guess my question is a bit stupid, but how would you go about > using Mahout > out of the box? > > Ideally (or maybe not?), I figured that I could just take care of > providing > the Vectors -> push it into mahout and run the kMeans clustering... > But when I started looking at the kMeans clustering example, I > notice that > there is actually a lot of implementation in the example itself... > Is it > really necessary for me to implement all of those methods in every > project > where I want to do kMeans? Can't they be reused? The methods I talk > about > are for instance: > static List populateCanopies(DistanceMeasure measure, > List > points, double t1, double t2) Yeah, this one is a bit weird here. > private static void referenceKmeans(List points, > List> clusters, DistanceMeasure measure, int maxIter) I think that is for testing purposes, but don't have the code up at the mo'. > private static boolean iterateReference(List points, > List > clusters, DistanceMeasure measure) > > In my narrow minded head I would think that input would be the > List > and that the output would be List of some general kMeans > method that did all the internals for me... Or am I missing > something? Or do > I have to use the KMeansDriver.runJob and read input from serialized > vectors > files? I think the piece that is missing is these algs. are designed to scale and use Hadoop. Imagine passing around 5+ million dense vectors of with large cardinality.