mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Choon-Siang \"Jeffrey04\" Lai" <mycyber...@yahoo.com>
Subject Re: #clojure #fkmeans - Clustering of Test Data Failed
Date Mon, 12 Sep 2011 07:49:54 GMT
Hi Danny,

I have read a small portion of the source code, for variation 1, an initial cluster will be
generated using RandomSeedGenerator if there is none found in the path so I don't have to
do the initial cluster myself. For variation 2, I actually have generated the initial cluster
using this code

        (RandomSeedGenerator/buildRandom
          hadoop_configuration
          input_path
          clusters_in_path
          (int 2)
          (new EuclideanDistanceMeasure))


I should have also mentioned that I am running my code using mahout 0.6-snapshot :)

Thanks for the reply anyway :)

best wishes,
Jeffrey04



>________________________________
>From: Danny Bickson <danny.bickson@gmail.com>
>To: user@mahout.apache.org; Jeffrey <mycyberpet@yahoo.com>
>Sent: Monday, September 12, 2011 3:31 PM
>Subject: Re: #clojure #fkmeans - Clustering of Test Data Failed
>
>
>Hi Jeffery!
>I have encountered this problem as well. The workaround, is to run one iteration of k-means,
to create initial cluster assignment and
>then run fuzzy k-means using the output from the first iteration of k-means.
>
>Hope this helps, 
>
>Danny Bickson
>
>
>On Mon, Sep 12, 2011 at 10:15 AM, Jeffrey <mycyberpet@yahoo.com> wrote:
>
>Hi,
>>
>>I have a test data that has a number of points, written to a sequence file using a
Clojure script as follows (I am equally just as bad in both JAVA and Clojure, since I really
don't like JAVA I wrote my scripts in Clojure whenever possible).
>>
>>    #!./bin/clj
>>    (ns sensei.sequence.core)
>>
>>    (require 'clojure.string)
>>    (require 'clojure.java.io)
>>
>>    (import org.apache.hadoop.conf.Configuration)
>>    (import org.apache.hadoop.fs.FileSystem)
>>    (import org.apache.hadoop.fs.Path)
>>    (import org.apache.hadoop.io.SequenceFile)
>>    (import org.apache.hadoop.io.Text)
>>
>>    (import org.apache.mahout.math.VectorWritable)
>>    (import org.apache.mahout.math.SequentialAccessSparseVector)
>>
>>    (with-open [reader (clojure.java.io/reader *in*)]
>>      (let [hadoop_configuration ((fn []
>>                                    (let [conf (new Configuration)]
>>                                      (. conf set "fs.default.name"
"hdfs://localhost:9000/")
>>                                      conf)))
>>            hadoop_fs (FileSystem/get hadoop_configuration)]
>>        (reduce
>>          (fn [writer [index value]]
>>            (. writer append index value)
>>            writer)
>>          (SequenceFile/createWriter
>>            hadoop_fs
>>            hadoop_configuration
>>            (new Path "test/sensei")
>>            Text
>>            VectorWritable)
>>          (map
>>            (fn [[tag row_vector]]
>>              (let [input_index (new Text tag)
>>                    input_vector (new VectorWritable)]
>>                (. input_vector set row_vector)
>>                [input_index input_vector]))
>>            (map
>>              (fn [[tag photo_list]]
>>                (let [photo_map (apply hash-map photo_list)
>>                      input_vector (new SequentialAccessSparseVector (count
(vals photo_map)))]
>>                  (loop [frequency_list (vals photo_map)]
>>                    (if (zero? (count frequency_list))
>>                      [tag input_vector]
>>                      (when-not (zero? (count frequency_list))
>>                        (. input_vector set
>>                           (mod (count frequency_list) (count (vals
photo_map)))
>>                           (Integer/parseInt (first frequency_list)))
>>                        (recur (rest frequency_list)))))))
>>              (reduce
>>                (fn [result next_line]
>>                  (let [[tag photo frequency] (clojure.string/split next_line
#" ")]
>>                    (update-in result [tag]
>>                      #(if (nil? %)
>>                         [photo frequency]
>>                         (conj % photo frequency)))))
>>                {}
>>                (line-seq reader)))))))
>>
>>Basically the script receives input (from stdin) in this format
>>
>>    tag_uri image_uri count
>>
>>e.g.
>>
>>    http://flickr.com/photos/tags/ísland http://flickr.com/photos/13980928@N03/6001200971
0
>>    http://flickr.com/photos/tags/ísland http://flickr.com/photos/21207178@N07/5441742937
0
>>    http://flickr.com/photos/tags/ísland http://flickr.com/photos/25845846@N06/3033371575
0
>>    http://flickr.com/photos/tags/ísland http://flickr.com/photos/30366924@N08/5772100510
0
>>    http://flickr.com/photos/tags/ísland http://flickr.com/photos/31343451@N00/5957189406
0
>>    http://flickr.com/photos/tags/ísland http://flickr.com/photos/36662563@N00/4815218552
1
>>    http://flickr.com/photos/tags/ísland http://flickr.com/photos/38583880@N00/5686968462
0
>>    http://flickr.com/photos/tags/ísland http://flickr.com/photos/43335486@N00/5794673203
0
>>    http://flickr.com/photos/tags/ísland http://flickr.com/photos/46857830@N03/5651576112
0
>>    http://flickr.com/photos/tags/ísland http://flickr.com/photos/99996011@N00/5396566822
0
>>
>>Then turn them into sequence file with each entry represents one point (10 dimensions
in this example) with key set to tag_uri <http://flickr.com/photos/tags/ísland> and
value set to point described by the frequency vector (0 0 0 0 0 1 0 0 0 0)
>>
>>I then use a script (available in 2 different variations) to send the data in as a
clustering job, however I am getting error that I don't know how this can be fixed. It seems
that something is wrong with the initial cluster.
>>
>>Script variation 1
>>
>>    #!./bin/clj
>>
>>    (ns sensei.clustering.fkmeans)
>>
>>    (import org.apache.hadoop.conf.Configuration)
>>    (import org.apache.hadoop.fs.Path)
>>
>>    (import org.apache.mahout.clustering.fuzzykmeans.FuzzyKMeansDriver)
>>    (import org.apache.mahout.common.distance.EuclideanDistanceMeasure)
>>    (import org.apache.mahout.clustering.kmeans.RandomSeedGenerator)
>>
>>    (let [hadoop_configuration ((fn []
>>                                    (let [conf (new Configuration)]
>>                                      (. conf set "fs.default.name"
"hdfs://localhost:9000/")
>>                                      conf)))
>>          driver (new FuzzyKMeansDriver)]
>>      (. driver setConf hadoop_configuration)
>>      (. driver
>>         run
>>         (into-array String ["--input" "test/sensei"
>>                             "--output" "test/clusters"
>>                             "--clusters" "test/clusters/clusters-0"
>>                             "--clustering"
>>                             "--overwrite"
>>                             "--emitMostLikely" "false"
>>                             "--numClusters" "3"
>>                             "--maxIter" "10"
>>                             "--m" "5"])))
>>
>>Script variation 2:
>>
>>    #!./bin/clj
>>
>>    (ns sensei.clustering.fkmeans)
>>
>>    (import org.apache.hadoop.conf.Configuration)
>>    (import org.apache.hadoop.fs.Path)
>>
>>    (import org.apache.mahout.clustering.fuzzykmeans.FuzzyKMeansDriver)
>>    (import org.apache.mahout.common.distance.EuclideanDistanceMeasure)
>>    (import org.apache.mahout.clustering.kmeans.RandomSeedGenerator)
>>
>>    (let [hadoop_configuration ((fn []
>>                                    (let [conf (new Configuration)]
>>                                      (. conf set "fs.default.name"
"hdfs://127.0.0.1:9000/")
>>                                      conf)))
>>          input_path (new Path "test/sensei")
>>          output_path (new Path "test/clusters")
>>          clusters_in_path (new Path "test/clusters/cluster-0")]
>>      (FuzzyKMeansDriver/run
>>        hadoop_configuration
>>        input_path
>>        (RandomSeedGenerator/buildRandom
>>          hadoop_configuration
>>          input_path
>>          clusters_in_path
>>          (int 2)
>>          (new EuclideanDistanceMeasure))
>>        output_path
>>        (new EuclideanDistanceMeasure)
>>        (double 0.5)
>>        (int 10)
>>        (float 5.0)
>>        true
>>        false
>>        (double 0.0)
>>        false)) '' runSequential
>>
>>I am getting the same error with both variations
>>
>>    SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
>>    SLF4J: Defaulting to no-operation (NOP) logger implementation
>>    SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
>>    11/08/25 15:20:16 WARN util.NativeCodeLoader: Unable to load native-hadoop library
for your platform... using builtin-java classes where applicable
>>    11/08/25 15:20:16 INFO compress.CodecPool: Got brand-new compressor
>>    11/08/25 15:20:16 INFO compress.CodecPool: Got brand-new decompressor
>>    11/08/25 15:20:17 WARN mapred.JobClient: Use GenericOptionsParser for parsing
the arguments. Applications should implement Tool for the same.
>>    11/08/25 15:20:17 INFO input.FileInputFormat: Total input paths to process :
1
>>    11/08/25 15:20:17 INFO mapred.JobClient: Running job: job_local_0001
>>    11/08/25 15:20:17 INFO mapred.MapTask: io.sort.mb = 100
>>    11/08/25 15:20:17 INFO mapred.MapTask: data buffer = 79691776/99614720
>>    11/08/25 15:20:17 INFO mapred.MapTask: record buffer = 262144/327680
>>    11/08/25 15:20:17 WARN mapred.LocalJobRunner: job_local_0001
>>    java.lang.IllegalStateException: No clusters found. Check your -c path.
>>            at org.apache.mahout.clustering.fuzzykmeans.FuzzyKMeansMapper.setup(FuzzyKMeansMapper.java:62)
>>            at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:142)
>>            at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:763)
>>            at org.apache.hadoop.mapred.MapTask.run(MapTask.java:369)
>>            at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:210)
>>    11/08/25 15:20:18 INFO mapred.JobClient:  map 0% reduce 0%
>>    11/08/25 15:20:18 INFO mapred.JobClient: Job complete: job_local_0001
>>    11/08/25 15:20:18 INFO mapred.JobClient: Counters: 0
>>    Exception in thread "main" java.lang.RuntimeException: java.lang.InterruptedException:
Fuzzy K-Means Iteration failed processing test/clusters/cluster-0/part-randomSeed
>>            at clojure.lang.Util.runtimeException(Util.java:153)
>>            at clojure.lang.Compiler.eval(Compiler.java:6417)
>>            at clojure.lang.Compiler.load(Compiler.java:6843)
>>            at clojure.lang.Compiler.loadFile(Compiler.java:6804)
>>            at clojure.main$load_script.invoke(main.clj:282)
>>            at clojure.main$script_opt.invoke(main.clj:342)
>>            at clojure.main$main.doInvoke(main.clj:426)
>>            at clojure.lang.RestFn.invoke(RestFn.java:436)
>>            at clojure.lang.Var.invoke(Var.java:409)
>>            at clojure.lang.AFn.applyToHelper(AFn.java:167)
>>            at clojure.lang.Var.applyTo(Var.java:518)
>>            at clojure.main.main(main.java:37)
>>    Caused by: java.lang.InterruptedException: Fuzzy K-Means Iteration failed processing
test/clusters/cluster-0/part-randomSeed
>>            at org.apache.mahout.clustering.fuzzykmeans.FuzzyKMeansDriver.runIteration(FuzzyKMeansDriver.java:252)
>>            at org.apache.mahout.clustering.fuzzykmeans.FuzzyKMeansDriver.buildClustersMR(FuzzyKMeansDriver.java:421)
>>            at org.apache.mahout.clustering.fuzzykmeans.FuzzyKMeansDriver.buildClusters(FuzzyKMeansDriver.java:345)
>>            at org.apache.mahout.clustering.fuzzykmeans.FuzzyKMeansDriver.run(FuzzyKMeansDriver.java:295)
>>            at sensei.clustering.fkmeans$eval17.invoke(fkmeans.clj:35)
>>            at clojure.lang.Compiler.eval(Compiler.java:6406)
>>            ... 10 more
>>
>>Notice there is a runSequential flag for the 2nd variation, if I set it to true
>>
>>    SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
>>    SLF4J: Defaulting to no-operation (NOP) logger implementation
>>    SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
>>    11/09/07 14:32:32 WARN util.NativeCodeLoader: Unable to load native-hadoop library
for your platform... using builtin-java classes where applicable
>>    11/09/07 14:32:32 INFO compress.CodecPool: Got brand-new compressor
>>    11/09/07 14:32:32 INFO compress.CodecPool: Got brand-new decompressor
>>    Exception in thread "main" java.lang.IllegalStateException: Clusters is empty!
>>            at org.apache.mahout.clustering.fuzzykmeans.FuzzyKMeansDriver.buildClustersSeq(FuzzyKMeansDriver.java:361)
>>            at org.apache.mahout.clustering.fuzzykmeans.FuzzyKMeansDriver.buildClusters(FuzzyKMeansDriver.java:343)
>>            at org.apache.mahout.clustering.fuzzykmeans.FuzzyKMeansDriver.run(FuzzyKMeansDriver.java:295)
>>            at sensei.clustering.fkmeans$eval17.invoke(fkmeans.clj:35)
>>            at clojure.lang.Compiler.eval(Compiler.java:6465)
>>            at clojure.lang.Compiler.load(Compiler.java:6902)
>>            at clojure.lang.Compiler.loadFile(Compiler.java:6863)
>>            at clojure.main$load_script.invoke(main.clj:282)
>>            at clojure.main$script_opt.invoke(main.clj:342)
>>            at clojure.main$main.doInvoke(main.clj:426)
>>            at clojure.lang.RestFn.invoke(RestFn.java:436)
>>            at clojure.lang.Var.invoke(Var.java:409)
>>            at clojure.lang.AFn.applyToHelper(AFn.java:167)
>>            at clojure.lang.Var.applyTo(Var.java:518)
>>            at clojure.main.main(main.java:37)
>>
>>Now, if I cluster the data using the CLI tool, it will complete without error
>>
>>    $ bin/mahout fkmeans --input test/sensei --output test/clusters --clusters
test/clusters/clusters-0 --clustering --overwrite --emitMostLikely false --numClusters 10
--maxIter 10 --m 5
>>
>>However, even there is this option: --clustering, I am not seeing any points in the
cluster dump generated with this command
>>
>>    $ ./bin/mahout clusterdump --seqFileDir test/clusters/clusters-1 --pointsDir
test/clusters/clusteredPoints --output sensei.txt
>>
>>And yeah, the command completed without any error too.
>>
>>... been stuck with this problem over and over again for months, and I can't still
get the clustering done properly :(
>>
>>Best wishes,
>>Jeffrey04
>
>
>
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message