Return-Path: Delivered-To: apmail-mahout-user-archive@www.apache.org Received: (qmail 94428 invoked from network); 2 Feb 2011 23:50:16 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 2 Feb 2011 23:50:16 -0000 Received: (qmail 8848 invoked by uid 500); 2 Feb 2011 23:50:13 -0000 Delivered-To: apmail-mahout-user-archive@mahout.apache.org Received: (qmail 8214 invoked by uid 500); 2 Feb 2011 23:50:13 -0000 Mailing-List: contact user-help@mahout.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@mahout.apache.org Delivered-To: mailing list user@mahout.apache.org Received: (qmail 8206 invoked by uid 99); 2 Feb 2011 23:50:13 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 02 Feb 2011 23:50:13 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=FREEMAIL_FROM,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of thelabdude@gmail.com designates 209.85.214.42 as permitted sender) Received: from [209.85.214.42] (HELO mail-bw0-f42.google.com) (209.85.214.42) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 02 Feb 2011 23:50:08 +0000 Received: by bwz13 with SMTP id 13so1038407bwz.1 for ; Wed, 02 Feb 2011 15:49:47 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:in-reply-to:references:date :message-id:subject:from:to:content-type; bh=PkstfGVw4Eh7vCAqeKWec2Z4iY0JbjtZXyGonQnCl1Q=; b=oKGQtbXH8mjy6OFWxEpREOfymF9FfnIgk98DuRdYodX578oD38n+xw3zUaCx9HJpCo zuOVRcA6rG02mXZDb11pGxpwLY3F7Nq/kzbNPGWMbzZlyrScvpqm6fJc80ZwxhXxDCvp mTQmwCQ0A1Lb6IFtG9k54Xu0Y5D/nBAissCJo= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; b=QygWeKEXM5F7jHDYiBY4UUOXMMmw+csfYixZ3CHaSv5Oo84gRBIRWyMl7JciaAyKeh N2T9PC0xpKfjYaI0QkTofKJ/PyIIjmtafX5kZr3DV4LumVrhkshnhlAe49tMoKu5ySvi 6dSnfB/9eEwKeOev1RFtBHrpK4OQfHdDGXDsI= MIME-Version: 1.0 Received: by 10.204.127.89 with SMTP id f25mr8991503bks.143.1296690586906; Wed, 02 Feb 2011 15:49:46 -0800 (PST) Received: by 10.204.70.135 with HTTP; Wed, 2 Feb 2011 15:49:46 -0800 (PST) In-Reply-To: <99CF5A2B2A1D9542A589C5F5EBD3DA0303847E0415@rock.narus.com> References: <24F03932F9E5CA46AF1C322B9E27818505192AFE@CINMLVEM14.e2k.ad.ge.com> <99CF5A2B2A1D9542A589C5F5EBD3DA0303847E0415@rock.narus.com> Date: Wed, 2 Feb 2011 16:49:46 -0700 Message-ID: Subject: Re: KMeans Clustering Issues From: Timothy Potter To: user@mahout.apache.org Content-Type: multipart/alternative; boundary=0016e6db2934d0f644049b554d24 --0016e6db2934d0f644049b554d24 Content-Type: text/plain; charset=ISO-8859-1 Hi Chris, If I'm reading your message correctly, it sounds like you are trying to pass sequence files as input to the clustering job. The clustering jobs require vectors as input, not just sequence files. So make sure you are pointing to the output of seq2sparse, which would be something like: path/tfidf-vectors or path/tf-vectors Cheers, Tim On Wed, Feb 2, 2011 at 1:21 PM, Jeff Eastman wrote: > Sounds like you might not be using the mahout-core-0.4-job.jar file? Also, > we don't run on Hadoop 0.20.1, only 20.2. Finally, trunk always has the > latest and greatest patches in it and the clustering stuff is quite stable > there. > > Jeff > > -----Original Message----- > From: McConnell, Christopher (GE Global Research) [mailto:mcconnel@ge.com] > Sent: Wednesday, February 02, 2011 11:35 AM > To: user@mahout.apache.org > Subject: KMeans Clustering Issues > > All, > > I've begun to look into Mahout on top of Hadoop, specifically for large > scale > cluster analysis. > > I am running into an issue however, attempting to run the > KMeansDriver.run(Configuration, Path, Path, Path, DistanceMeasure, double, > int, Boolean, Boolean) with the last (runSequential) false when the data is > stored on HDFS. > > I've seen multiple listings about this claiming a fix within the > KMeansDriver > by adding the job.setJarByClass() method call, however I am still getting > the > typical ClassNotFoundException: org.apache.mahout.math.Vector. > > A quick overview, we've created a Map job to take our current dataset and > convert it into the Sequence files required for the driver to be executed. > We > have then tried a few different ways of calling the KMeansDriver.run() - > either within the same driver as the previous MR job or separately for a > new > JVM. Both of these tests were run through the Hadoop environment. Next, > I've > tried running a standalone Java application, setting up the configuration > to > read from HDFS, but not run within the Hadoop environment - this gives us > the > same ClassNotFoundException. > > Our versions are Mahout 0.4, Hadoop 0.20.1+169.89 and Hadoop 0.20.2 (We > have > multiple clusters for testing). > > I have done other tests with the KMeansDriver that did work, for example, > utilizing the method within memory works fine. We can also run the > clustering > over MapReduce, if the job is launched through a java -jar command and data > stored locally. Finally, I can execute the mahout binary with the kmeans > argument (./mahout kmeans -c path -i path -x #) which also works fine, > however > we do not want to rely on creating multiple stages/running multiple (and > separate) applications. > > Any thoughts are appreciated. > Thanks, > Chris > > > Christopher McConnell > Computer Scientist > Advanced Computing Lab > Edison Engineering Development Program > GE Global Research > > T +1 518 387 5176 > mcconnel@ge.com > > One Research Circle > Niskayuna, NY 12309 > > GE Imagination at Work > > > --0016e6db2934d0f644049b554d24--