mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alex Hasha <a...@bundle.com>
Subject Having a devil of a time running k-means examples with Mahout 0.6 / Hadoop 0.20.2
Date Wed, 09 May 2012 21:13:25 GMT
Hello all,

We have not been able to get the reuters k-means clustering example to run
without errors on our system for quite a while.  We are running hadoop
0.20.2 on a medium sized cluster, and have installed Mahout 0.6.

The example shell scripts that were packaged with the release crashed and
burned, so I have been following the step by step instructions for running
k-means on a cluster that are scattered through Chapters 8,9, and 11 of
Mahout In Action.

In particular, I've manually downloaded
http://www.daviddlewis.com/resources/testcollections/reuters21578/reuters21578.tar.gz,
unpacked it to examples/reuters, and run

$ mvn -e -q exec:java
-Dexec.mainClass="org.apache.lucene.benchmark.utils.ExtractReuters"
-Dexec.args="reuters/ reuters-extracted/"

to extract the raw text files to reuters-extracted.  I then uploaded
reuters-extracted/ to HDFS (/user/hadoop/mahout) and ran

$ bin/mahout seqdirectory -c UTF-8 -i mahout/reuters-extracted/ -o
mahout/reuters-seqfiles

which seemed to run without error, and

$bin/mahout seq2sparse -i mahout/reuters-seqfiles/ -o
mahout/reuters-vectors -ow

which also seemed to run without error.

There is nontrivial data in the reuters-vectors output directory:

$ hadoop fs -du mahout/reuters-vectors
Found 7 items
869751      hdfs://master:54310/user/hadoop/mahout/reuters-vectors/df-count
824086
 hdfs://master:54310/user/hadoop/mahout/reuters-vectors/dictionary.file-0
844593
 hdfs://master:54310/user/hadoop/mahout/reuters-vectors/frequency.file-0
17148933
 hdfs://master:54310/user/hadoop/mahout/reuters-vectors/tf-vectors
16931936
 hdfs://master:54310/user/hadoop/mahout/reuters-vectors/tfidf-vectors
15098540
 hdfs://master:54310/user/hadoop/mahout/reuters-vectors/tokenized-documents
1018157     hdfs://master:54310/user/hadoop/mahout/reuters-vectors/wordcount

And then I run k-means with the following command line:

$ bin/mahout kmeans -i mahout/reuters-vectors/tfidf-vectors/ -c
mahout/reuters-initial-clusters -o mahout/reuters-kmeans-clusters -dm
org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure -cd 1.0
-k 20 -x 20 -cl

As recommended in Mahout In Action.  Here is the output.  The error appears
to relate to a problem with the binary format headers of one of the input
files, so my debugging skills are exhausted at this point.  If anyone has
solved a similar problem, I would be very appreciative for a hint or two.

MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
Running on hadoop, using HADOOP_HOME=/home/hadoop/hadoop-0.20.2
HADOOP_CONF_DIR=/home/hadoop/hadoop-0.20.2/conf
MAHOUT-JOB:
/home/hadoop/mahout-distribution-0.6/examples/target/mahout-examples-0.6-job.jar
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in
[jar:file:/home/hadoop/hadoop-0.20.2/lib/slf4j-jcl-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in
[jar:file:/home/hadoop/hadoop-0.20.2/lib/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an
explanation.
12/05/09 16:42:54 INFO common.AbstractJob: Command line arguments:
{--clustering=null, --clusters=mahout/reuters-initial-clusters,
--convergenceDelta=1.0,
--distanceMeasure=org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure,
--endPhase=2147483647, --input=mahout/reuters-vectors/tfidf-vectors/,
--maxIter=20, --method=mapreduce, --numClusters=20,
--output=mahout/reuters-kmeans-clusters, --startPhase=0, --tempDir=temp}
12/05/09 16:42:54 INFO util.NativeCodeLoader: Loaded the native-hadoop
library
12/05/09 16:42:54 INFO zlib.ZlibFactory: Successfully loaded & initialized
native-zlib library
12/05/09 16:42:54 INFO compress.CodecPool: Got brand-new compressor
12/05/09 16:42:56 INFO kmeans.RandomSeedGenerator: Wrote 20 vectors to
mahout/reuters-initial-clusters/part-randomSeed
12/05/09 16:42:56 INFO kmeans.KMeansDriver: Input:
mahout/reuters-vectors/tfidf-vectors Clusters In:
mahout/reuters-initial-clusters/part-randomSeed Out:
mahout/reuters-kmeans-clusters Distance:
org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure
12/05/09 16:42:56 INFO kmeans.KMeansDriver: convergence: 1.0 max
Iterations: 20 num Reduce Tasks: org.apache.mahout.math.VectorWritable
Input Vectors: {}
12/05/09 16:42:56 INFO kmeans.KMeansDriver: K-Means Iteration 1
12/05/09 16:42:58 INFO input.FileInputFormat: Total input paths to process
: 1
12/05/09 16:42:58 INFO mapred.JobClient: Running job: job_201205031638_0165
12/05/09 16:42:59 INFO mapred.JobClient:  map 0% reduce 0%
12/05/09 16:43:14 INFO mapred.JobClient: Task Id :
attempt_201205031638_0165_m_000000_0, Status : FAILED
java.lang.IllegalArgumentException: Unknown flags set: %d [1000000]
at
com.google.common.base.Preconditions.checkArgument(Preconditions.java:115)
at org.apache.mahout.math.VectorWritable.readFields(VectorWritable.java:86)
at org.apache.mahout.math.VectorWritable.readVector(VectorWritable.java:190)
at
org.apache.mahout.clustering.AbstractCluster.readFields(AbstractCluster.java:98)
at
org.apache.mahout.clustering.DistanceMeasureCluster.readFields(DistanceMeasureCluster.java:53)
at org.apache.mahout.clustering.kmeans.Cluster.readFields(Cluster.java:70)
at
org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:1751)
at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1879)
at
org.apache.mahout.common.iterator.sequencefile.SequenceFileValueIterator.computeNext(SequenceFileValueIterator.java:76)
at
org.apache.mahout.common.iterator.sequencefile.SequenceFileValueIterator.computeNext(SequenceFileValueIterator.java:35)
at
com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:141)
at
com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:136)
at com.google.common.collect.Iterators$5.hasNext(Iterators.java:525)
at
com.google.common.collect.ForwardingIterator.hasNext(ForwardingIterator.java:43)
at
org.apache.mahout.clustering.kmeans.KMeansUtil.configureWithClusterInfo(KMeansUtil.java:42)
at
org.apache.mahout.clustering.kmeans.KMeansMapper.setup(KMeansMapper.java:57)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:142)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
at org.apache.hadoop.mapred.Child.main(Child.java:170)

attempt_201205031638_0165_m_000000_0: SLF4J: Class path contains multiple
SLF4J bindings.
attempt_201205031638_0165_m_000000_0: SLF4J: Found binding in
[jar:file:/home/hadoop/hadoop-0.20.2/lib/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
attempt_201205031638_0165_m_000000_0: SLF4J: Found binding in
[file:/mnt/secondary/hadoop/temp/taskTracker/jobcache/job_201205031638_0165/jars/org/slf4j/impl/StaticLoggerBinder.class]
attempt_201205031638_0165_m_000000_0: SLF4J: See
http://www.slf4j.org/codes.html#multiple_bindings for an explanation.

Best,

Alex

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message