mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris McConnell <c.t.mcconnell...@gmail.com>
Subject Re: KMeans Clustering Issues
Date Sun, 06 Feb 2011 18:00:32 GMT
Hi Lokendra,

Great point and it turned out to the case. We had synced the path (or
thought we had), however it seems it didn't take.

Thanks again, it looks like things are working as expected at this point.
Cheers,
Chris

On Thu, Feb 3, 2011 at 1:21 PM, Lokendra Singh <lsingh.969@gmail.com> wrote:
> Hi,
>
> If you are mainly facing problems with ClassNotFound in Hadoop Environment,
> I would suggest you to put all the mahout jars in HADOOP_CLASSPATH in
> '$HADOOP_HOME/conf/hadoop-env.sh'. Also, while running the MR job, make sure
> that $HADOOP_HOME/conf exists in your classpath.
>
> Regards
> Lokendra
>
>
> On Thu, Feb 3, 2011 at 11:43 PM, Chris McConnell <c.t.mcconnell.ge@gmail.com
>> wrote:
>
>> Hi Tim, Jeff,
>> First, sorry for starting a new thread, apparently our proxy will not
>> let the listing replies come through.
>> In any event, to answer both of you:
>>
>> Jeff - you are correct, we did not utilize the core-job jar, and
>> however we add all the JAR dependencies (util, math, commons,
>> collections...) through Maven dependencies. We also tried to run this
>> on Hadoop 0.20.2, but received the same result. Note that I can run
>> Mahout without a problem as a standalone binary (using MapReduce, not
>> in memory clustering) on the Hadoop 0.20.1+...
>>
>> Tim - Thanks for the reply, I should've been more specific. We are
>> converting our data to NamedVectors and writing those out into
>> SequenceFiles for the clustering algorithm(s). Once this is done, I
>> select the first x Vectors (re-read from the newly created
>> SequenceFiles to ensure the read and write are correct) and create a
>> new SequenceFile with the Cluster objects. The output appears correct,
>> as we can run the Mahout binary from a command line and obtain
>> results; just the chaining process is failing.
>>
>> Here is a quick rundown of the application (some code removed to shorten).
>> Thanks again guys, any thoughts are appreciated.
>>
>> Chris
>>
>>   -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>>
>> Job job = new Job(conf, "Example Hadoop");
>> job.setJarByClass(DataConversionForClustering.class);
>> job.setMapperClass(LZORMDClusterConversionMapper.class);
>> // No reducer needed.
>> job.setNumReduceTasks(0);
>> // ensure it's in Mahout's required format
>> job.setOutputFormatClass(SequenceFileOutputFormat.class);
>> job.setOutputKeyClass(LongWritable.class);
>> job.setOutputValueClass(VectorWritable.class);
>> if (inputPathForConversion.contains(",")) {
>> FileOutputFormat.setOutputPath(job, new Path(outputPathForVectors));
>> boolean success = job.waitForCompletion(true);
>> // in order for the clustering to work for all run techniques, it
>> needs to start with "part"
>> Path outputPath = new Path(outputPathForClusters + "part-clusters");
>> SequenceFile.Writer clusterWriter = new
>> SequenceFile.Writer(FileSystem.get(conf) , conf,
>> outputPath, Text.class, Cluster.class);
>> // code omitted
>> Cluster clusterPoint = new Cluster(value.get(), currentCluster, new
>> EuclideanDistanceMeasure());
>> clusterPoint.observe(clusterPoint.getCenter());
>> clusterWriter.append(new Text(clusterPoint.getIdentifier()), clusterPoint);
>> currentCluster++;
>> // code omitted
>> reader.close();
>> clusterWriter.close();
>>  KMeansDriver.run(conf, new Path(otherArgs[0].trim()), new
>> Path(otherArgs[1].trim()),
>>     new Path("vector/final_output/"), new EuclideanDistanceMeasure(),
>> 0.00001,
>>     500, true, false);
>>
>>  -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>>
>> Hi Chris,
>> If I'm reading your message correctly, it sounds like you are trying to
>> pass
>> sequence files as input to the clustering job. The clustering jobs require
>> vectors as input, not just sequence files. So make sure you are pointing to
>> the output of seq2sparse, which would be something like: path/tfidf-vectors
>> or path/tf-vectors
>> Cheers,
>> Tim
>> On Wed, Feb 2, 2011 at 1:21 PM, Jeff Eastman <jeastman@narus.com> wrote:
>> > Sounds like you might not be using the mahout-core-0.4-job.jar file?
>> Also,
>> > we don't run on Hadoop 0.20.1, only 20.2. Finally, trunk always has the
>> > latest and greatest patches in it and the clustering stuff is quite
>> stable
>> > there.
>> >
>> > Jeff
>> >
>> > -----Original Message-----
>> > From: McConnell, Christopher (GE Global Research) [mailto:
>> mcconnel@ge.com]
>> > Sent: Wednesday, February 02, 2011 11:35 AM
>> > To: user@mahout.apache.org
>> > Subject: KMeans Clustering Issues
>> >
>> > All,
>> >
>> > I've begun to look into Mahout on top of Hadoop, specifically for large
>> > scale
>> > cluster analysis.
>> >
>> > I am running into an issue however, attempting to run the
>> > KMeansDriver.run(Configuration, Path, Path, Path, DistanceMeasure,
>> double,
>> > int, Boolean, Boolean) with the last (runSequential) false when the data
>> is
>> > stored on HDFS.
>> >
>> > I've seen multiple listings about this claiming a fix within the
>> > KMeansDriver
>> > by adding the job.setJarByClass() method call, however I am still getting
>> > the
>> > typical ClassNotFoundException: org.apache.mahout.math.Vector.
>> >
>> > A quick overview, we've created a Map job to take our current dataset and
>> > convert it into the Sequence files required for the driver to be
>> executed.
>> > We
>> > have then tried a few different ways of calling the KMeansDriver.run() -
>> > either within the same driver as the previous MR job or separately for a
>> > new
>> > JVM. Both of these tests were run through the Hadoop environment. Next,
>> > I've
>> > tried running a standalone Java application, setting up the configuration
>> > to
>> > read from HDFS, but not run within the Hadoop environment - this gives us
>> > the
>> > same ClassNotFoundException.
>> >
>> > Our versions are Mahout 0.4, Hadoop 0.20.1+169.89 and Hadoop 0.20.2 (We
>> > have
>> > multiple clusters for testing).
>> >
>> > I have done other tests with the KMeansDriver that did work, for example,
>> > utilizing the method within memory works fine. We can also run the
>> > clustering
>> > over MapReduce, if the job is launched through a java -jar command and
>> data
>> > stored locally. Finally, I can execute the mahout binary with the kmeans
>> > argument (./mahout kmeans -c path -i path -x #) which also works fine,
>> > however
>> > we do not want to rely on creating multiple stages/running multiple (and
>> > separate) applications.
>> >
>> > Any thoughts are appreciated.
>> > Thanks,
>> > Chris
>> >
>> >
>> > Christopher McConnell
>> > Computer Scientist
>> > Advanced Computing Lab
>> > Edison Engineering Development Program
>> > GE Global Research
>> >
>> > T  +1 518 387 5176
>> > mcconnel@ge.com
>> >
>> > One Research Circle
>> > Niskayuna, NY 12309
>> >
>> > GE Imagination at Work
>> >
>> >
>> >
>>
>

Mime
View raw message