mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Lokendra Singh <lsingh....@gmail.com>
Subject Re: KMeans Clustering Issues
Date Thu, 03 Feb 2011 18:21:39 GMT
Hi,

If you are mainly facing problems with ClassNotFound in Hadoop Environment,
I would suggest you to put all the mahout jars in HADOOP_CLASSPATH in
'$HADOOP_HOME/conf/hadoop-env.sh'. Also, while running the MR job, make sure
that $HADOOP_HOME/conf exists in your classpath.

Regards
Lokendra


On Thu, Feb 3, 2011 at 11:43 PM, Chris McConnell <c.t.mcconnell.ge@gmail.com
> wrote:

> Hi Tim, Jeff,
> First, sorry for starting a new thread, apparently our proxy will not
> let the listing replies come through.
> In any event, to answer both of you:
>
> Jeff - you are correct, we did not utilize the core-job jar, and
> however we add all the JAR dependencies (util, math, commons,
> collections...) through Maven dependencies. We also tried to run this
> on Hadoop 0.20.2, but received the same result. Note that I can run
> Mahout without a problem as a standalone binary (using MapReduce, not
> in memory clustering) on the Hadoop 0.20.1+...
>
> Tim - Thanks for the reply, I should've been more specific. We are
> converting our data to NamedVectors and writing those out into
> SequenceFiles for the clustering algorithm(s). Once this is done, I
> select the first x Vectors (re-read from the newly created
> SequenceFiles to ensure the read and write are correct) and create a
> new SequenceFile with the Cluster objects. The output appears correct,
> as we can run the Mahout binary from a command line and obtain
> results; just the chaining process is failing.
>
> Here is a quick rundown of the application (some code removed to shorten).
> Thanks again guys, any thoughts are appreciated.
>
> Chris
>
>   -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> Job job = new Job(conf, "Example Hadoop");
> job.setJarByClass(DataConversionForClustering.class);
> job.setMapperClass(LZORMDClusterConversionMapper.class);
> // No reducer needed.
> job.setNumReduceTasks(0);
> // ensure it's in Mahout's required format
> job.setOutputFormatClass(SequenceFileOutputFormat.class);
> job.setOutputKeyClass(LongWritable.class);
> job.setOutputValueClass(VectorWritable.class);
> if (inputPathForConversion.contains(",")) {
> FileOutputFormat.setOutputPath(job, new Path(outputPathForVectors));
> boolean success = job.waitForCompletion(true);
> // in order for the clustering to work for all run techniques, it
> needs to start with "part"
> Path outputPath = new Path(outputPathForClusters + "part-clusters");
> SequenceFile.Writer clusterWriter = new
> SequenceFile.Writer(FileSystem.get(conf) , conf,
> outputPath, Text.class, Cluster.class);
> // code omitted
> Cluster clusterPoint = new Cluster(value.get(), currentCluster, new
> EuclideanDistanceMeasure());
> clusterPoint.observe(clusterPoint.getCenter());
> clusterWriter.append(new Text(clusterPoint.getIdentifier()), clusterPoint);
> currentCluster++;
> // code omitted
> reader.close();
> clusterWriter.close();
>  KMeansDriver.run(conf, new Path(otherArgs[0].trim()), new
> Path(otherArgs[1].trim()),
>     new Path("vector/final_output/"), new EuclideanDistanceMeasure(),
> 0.00001,
>     500, true, false);
>
>  -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> Hi Chris,
> If I'm reading your message correctly, it sounds like you are trying to
> pass
> sequence files as input to the clustering job. The clustering jobs require
> vectors as input, not just sequence files. So make sure you are pointing to
> the output of seq2sparse, which would be something like: path/tfidf-vectors
> or path/tf-vectors
> Cheers,
> Tim
> On Wed, Feb 2, 2011 at 1:21 PM, Jeff Eastman <jeastman@narus.com> wrote:
> > Sounds like you might not be using the mahout-core-0.4-job.jar file?
> Also,
> > we don't run on Hadoop 0.20.1, only 20.2. Finally, trunk always has the
> > latest and greatest patches in it and the clustering stuff is quite
> stable
> > there.
> >
> > Jeff
> >
> > -----Original Message-----
> > From: McConnell, Christopher (GE Global Research) [mailto:
> mcconnel@ge.com]
> > Sent: Wednesday, February 02, 2011 11:35 AM
> > To: user@mahout.apache.org
> > Subject: KMeans Clustering Issues
> >
> > All,
> >
> > I've begun to look into Mahout on top of Hadoop, specifically for large
> > scale
> > cluster analysis.
> >
> > I am running into an issue however, attempting to run the
> > KMeansDriver.run(Configuration, Path, Path, Path, DistanceMeasure,
> double,
> > int, Boolean, Boolean) with the last (runSequential) false when the data
> is
> > stored on HDFS.
> >
> > I've seen multiple listings about this claiming a fix within the
> > KMeansDriver
> > by adding the job.setJarByClass() method call, however I am still getting
> > the
> > typical ClassNotFoundException: org.apache.mahout.math.Vector.
> >
> > A quick overview, we've created a Map job to take our current dataset and
> > convert it into the Sequence files required for the driver to be
> executed.
> > We
> > have then tried a few different ways of calling the KMeansDriver.run() -
> > either within the same driver as the previous MR job or separately for a
> > new
> > JVM. Both of these tests were run through the Hadoop environment. Next,
> > I've
> > tried running a standalone Java application, setting up the configuration
> > to
> > read from HDFS, but not run within the Hadoop environment - this gives us
> > the
> > same ClassNotFoundException.
> >
> > Our versions are Mahout 0.4, Hadoop 0.20.1+169.89 and Hadoop 0.20.2 (We
> > have
> > multiple clusters for testing).
> >
> > I have done other tests with the KMeansDriver that did work, for example,
> > utilizing the method within memory works fine. We can also run the
> > clustering
> > over MapReduce, if the job is launched through a java -jar command and
> data
> > stored locally. Finally, I can execute the mahout binary with the kmeans
> > argument (./mahout kmeans -c path -i path -x #) which also works fine,
> > however
> > we do not want to rely on creating multiple stages/running multiple (and
> > separate) applications.
> >
> > Any thoughts are appreciated.
> > Thanks,
> > Chris
> >
> >
> > Christopher McConnell
> > Computer Scientist
> > Advanced Computing Lab
> > Edison Engineering Development Program
> > GE Global Research
> >
> > T  +1 518 387 5176
> > mcconnel@ge.com
> >
> > One Research Circle
> > Niskayuna, NY 12309
> >
> > GE Imagination at Work
> >
> >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message