mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jeffrey <mycyber...@yahoo.com>
Subject Re: fkmeans or Cluster Dumper not working?
Date Sun, 24 Jul 2011 07:51:27 GMT
Erm, is there any update? is the problem reproducible?

Best wishes,
Jeffrey04



>________________________________
>From: Jeffrey <mycyberpet@yahoo.com>
>To: Jeff Eastman <jeastman@Narus.com>; "user@mahout.apache.org" <user@mahout.apache.org>
>Sent: Friday, July 22, 2011 12:40 AM
>Subject: Re: fkmeans or Cluster Dumper not working?
>
>
>Hi Jeff,
>
>
>lol, this is probably my last reply before i fall asleep (GMT+8 here).
>
>
>First thing first, data file is here: http://coolsilon.com/image-tag.mvc
>
>
>Q: What is the cardinality of your vector data?
>about 1000+ rows (resources) * 14 000+ columns (tags)
>Q: Is it sparse or dense?
>sparse (assuming sparse = each vector contains mostly 0)
>Q: How many vectors are you trying to cluster?
>all of them? (1000+ rows)
>Q: What is the exact error you see when fkmeans fails with k=10? With k=50?
>i think i posted the exception when k=50, but will post them again here
>
>
>k=10, fkmeans actually works, but cluster dumper returns exception, however, if i take
out --pointsDir, then it would work (output looks ok, but without all the points)
>
>
>    $ bin/mahout fkmeans --input sensei/image-tag.arff.mvc --output sensei/clusters
--clusters sensei/clusters/clusters-0 --clustering --overwrite --emitMostLikely false --numClusters
10 --maxIter 10 --m 5
>    ...
>    $ bin/mahout clusterdump --seqFileDir sensei/clusters/clusters-1 --pointsDir sensei/clusters/clusteredPoints
--output image-tag-clusters.txt Running on hadoop, using HADOOP_HOME=/home/jeffrey04/Applications/hadoop-0.20.203.0
>    HADOOP_CONF_DIR=/home/jeffrey04/Applications/hadoop-0.20.203.0/conf
>    MAHOUT-JOB: /home/jeffrey04/Applications/mahout/examples/target/mahout-examples-0.6-SNAPSHOT-job.jar
>    11/07/22 00:14:50 INFO common.AbstractJob: Command line arguments: {--dictionaryType=text,
--endPhase=2147483647, --output=image-tag-clusters.txt, --pointsDir=sensei/clusters/clusteredPoints,
--seqFileDir=sensei/clusters/clusters-1, --startPhase=0, --tempDir=temp}
>    Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
>            at java.lang.Object.clone(Native Method)
>            at org.apache.mahout.math.DenseVector.<init>(DenseVector.java:44)
>            at org.apache.mahout.math.DenseVector.<init>(DenseVector.java:39)
>            at org.apache.mahout.math.VectorWritable.readFields(VectorWritable.java:94)
>            at org.apache.mahout.clustering.WeightedVectorWritable.readFields(WeightedVectorWritable.java:55)
>            at org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:1751)
>            at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1879)
>            at org.apache.mahout.common.iterator.sequencefile.SequenceFileIterator.computeNext(SequenceFileIterator.java:95)
>            at org.apache.mahout.common.iterator.sequencefile.SequenceFileIterator.computeNext(SequenceFileIterator.java:38)
>            at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:141)
>            at com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:136)
>            at com.google.common.collect.Iterators$5.hasNext(Iterators.java:525)
>            at com.google.common.collect.ForwardingIterator.hasNext(ForwardingIterator.java:43)
>            at org.apache.mahout.utils.clustering.ClusterDumper.readPoints(ClusterDumper.java:255)
>            at org.apache.mahout.utils.clustering.ClusterDumper.init(ClusterDumper.java:209)
>            at org.apache.mahout.utils.clustering.ClusterDumper.run(ClusterDumper.java:123)
>            at org.apache.mahout.utils.clustering.ClusterDumper.main(ClusterDumper.java:89)
>            at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>            at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>            at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>            at java.lang.reflect.Method.invoke(Method.java:616)
>            at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
>            at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
>            at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:188)
>            at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>            at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>            at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>            at java.lang.reflect.Method.invoke(Method.java:616)
>            at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
>    $ bin/mahout clusterdump --seqFileDir sensei/clusters/clusters-1 --output image-tag-clusters.txt
Running on hadoop, using HADOOP_HOME=/home/jeffrey04/Applications/hadoop-0.20.203.0
>    HADOOP_CONF_DIR=/home/jeffrey04/Applications/hadoop-0.20.203.0/conf
>    MAHOUT-JOB: /home/jeffrey04/Applications/mahout/examples/target/mahout-examples-0.6-SNAPSHOT-job.jar
>    11/07/22 00:19:04 INFO common.AbstractJob: Command line arguments: {--dictionaryType=text,
--endPhase=2147483647, --output=image-tag-clusters.txt, --seqFileDir=sensei/clusters/clusters-1,
--startPhase=0, --tempDir=temp}
>    11/07/22 00:19:13 INFO driver.MahoutDriver: Program took 9504 ms
>
>
>k=50, fkmeans shows exception after map 100% reduce 0%, and would retry (map 0% reduce
0%) after the exception
>
>
>    $ bin/mahout fkmeans --input sensei/image-tag.arff.mvc --output sensei/clusters
--clusters sensei/clusters/clusters-0 --clustering --overwrite --emitMostLikely false --numClusters
50 --maxIter 10 --m 5
>    Running on hadoop, using HADOOP_HOME=/home/jeffrey04/Applications/hadoop-0.20.203.0
>    HADOOP_CONF_DIR=/home/jeffrey04/Applications/hadoop-0.20.203.0/conf
>    MAHOUT-JOB: /home/jeffrey04/Applications/mahout/examples/target/mahout-examples-0.6-SNAPSHOT-job.jar
>    11/07/22 00:21:07 INFO common.AbstractJob: Command line arguments: {--clustering=null,
--clusters=sensei/clusters/clusters-0, --convergenceDelta=0.5, --distanceMeasure=org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure,
--emitMostLikely=false, --endPhase=2147483647, --input=sensei/image-tag.arff.mvc, --m=5, --maxIter=10,
--method=mapreduce, --numClusters=50, --output=sensei/clusters, --overwrite=null, --startPhase=0,
--tempDir=temp, --threshold=0}
>    11/07/22 00:21:09 INFO common.HadoopUtil: Deleting sensei/clusters
>    11/07/22 00:21:09 INFO util.NativeCodeLoader: Loaded the native-hadoop library
>    11/07/22 00:21:09 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib
library
>    11/07/22 00:21:09 INFO compress.CodecPool: Got brand-new compressor
>    11/07/22 00:21:10 INFO compress.CodecPool: Got brand-new decompressor
>    11/07/22 00:21:21 INFO kmeans.RandomSeedGenerator: Wrote 50 vectors to sensei/clusters/clusters-0/part-randomSeed
>    11/07/22 00:21:24 INFO fuzzykmeans.FuzzyKMeansDriver: Fuzzy K-Means Iteration 1
>    11/07/22 00:21:25 INFO input.FileInputFormat: Total input paths to process : 1
>    11/07/22 00:21:26 INFO mapred.JobClient: Running job: job_201107211512_0029
>    11/07/22 00:21:27 INFO mapred.JobClient:  map 0% reduce 0%
>    11/07/22 00:22:08 INFO mapred.JobClient:  map 1% reduce 0%
>    11/07/22 00:22:20 INFO mapred.JobClient:  map 2% reduce 0%
>    11/07/22 00:22:33 INFO mapred.JobClient:  map 3% reduce 0%
>    11/07/22 00:22:42 INFO mapred.JobClient:  map 4% reduce 0%
>    11/07/22 00:22:50 INFO mapred.JobClient:  map 5% reduce 0%
>    11/07/22 00:23:00 INFO mapred.JobClient:  map 6% reduce 0%
>    11/07/22 00:23:09 INFO mapred.JobClient:  map 7% reduce 0%
>    11/07/22 00:23:18 INFO mapred.JobClient:  map 8% reduce 0%
>    11/07/22 00:23:27 INFO mapred.JobClient:  map 9% reduce 0%
>    11/07/22 00:23:33 INFO mapred.JobClient:  map 10% reduce 0%
>    11/07/22 00:23:42 INFO mapred.JobClient:  map 11% reduce 0%
>    11/07/22 00:23:45 INFO mapred.JobClient:  map 12% reduce 0%
>    11/07/22 00:23:54 INFO mapred.JobClient:  map 13% reduce 0%
>    11/07/22 00:24:03 INFO mapred.JobClient:  map 14% reduce 0%
>    11/07/22 00:24:09 INFO mapred.JobClient:  map 15% reduce 0%
>    11/07/22 00:24:15 INFO mapred.JobClient:  map 16% reduce 0%
>    11/07/22 00:24:24 INFO mapred.JobClient:  map 17% reduce 0%
>    11/07/22 00:24:30 INFO mapred.JobClient:  map 18% reduce 0%
>    11/07/22 00:24:42 INFO mapred.JobClient:  map 19% reduce 0%
>    11/07/22 00:24:51 INFO mapred.JobClient:  map 20% reduce 0%
>    11/07/22 00:24:57 INFO mapred.JobClient:  map 21% reduce 0%
>    11/07/22 00:25:06 INFO mapred.JobClient:  map 22% reduce 0%
>    11/07/22 00:25:09 INFO mapred.JobClient:  map 23% reduce 0%
>    11/07/22 00:25:19 INFO mapred.JobClient:  map 24% reduce 0%
>    11/07/22 00:25:25 INFO mapred.JobClient:  map 25% reduce 0%
>    11/07/22 00:25:31 INFO mapred.JobClient:  map 26% reduce 0%
>    11/07/22 00:25:37 INFO mapred.JobClient:  map 27% reduce 0%
>    11/07/22 00:25:43 INFO mapred.JobClient:  map 28% reduce 0%
>    11/07/22 00:25:51 INFO mapred.JobClient:  map 29% reduce 0%
>    11/07/22 00:25:58 INFO mapred.JobClient:  map 30% reduce 0%
>    11/07/22 00:26:04 INFO mapred.JobClient:  map 31% reduce 0%
>    11/07/22 00:26:10 INFO mapred.JobClient:  map 32% reduce 0%
>    11/07/22 00:26:19 INFO mapred.JobClient:  map 33% reduce 0%
>    11/07/22 00:26:25 INFO mapred.JobClient:  map 34% reduce 0%
>    11/07/22 00:26:34 INFO mapred.JobClient:  map 35% reduce 0%
>    11/07/22 00:26:40 INFO mapred.JobClient:  map 36% reduce 0%
>    11/07/22 00:26:49 INFO mapred.JobClient:  map 37% reduce 0%
>    11/07/22 00:26:55 INFO mapred.JobClient:  map 38% reduce 0%
>    11/07/22 00:27:04 INFO mapred.JobClient:  map 39% reduce 0%
>    11/07/22 00:27:14 INFO mapred.JobClient:  map 40% reduce 0%
>    11/07/22 00:27:23 INFO mapred.JobClient:  map 41% reduce 0%
>    11/07/22 00:27:28 INFO mapred.JobClient:  map 42% reduce 0%
>    11/07/22 00:27:34 INFO mapred.JobClient:  map 43% reduce 0%
>    11/07/22 00:27:40 INFO mapred.JobClient:  map 44% reduce 0%
>    11/07/22 00:27:49 INFO mapred.JobClient:  map 45% reduce 0%
>    11/07/22 00:27:56 INFO mapred.JobClient:  map 46% reduce 0%
>    11/07/22 00:28:05 INFO mapred.JobClient:  map 47% reduce 0%
>    11/07/22 00:28:11 INFO mapred.JobClient:  map 48% reduce 0%
>    11/07/22 00:28:20 INFO mapred.JobClient:  map 49% reduce 0%
>    11/07/22 00:28:26 INFO mapred.JobClient:  map 50% reduce 0%
>    11/07/22 00:28:35 INFO mapred.JobClient:  map 51% reduce 0%
>    11/07/22 00:28:41 INFO mapred.JobClient:  map 52% reduce 0%
>    11/07/22 00:28:47 INFO mapred.JobClient:  map 53% reduce 0%
>    11/07/22 00:28:53 INFO mapred.JobClient:  map 54% reduce 0%
>    11/07/22 00:29:02 INFO mapred.JobClient:  map 55% reduce 0%
>    11/07/22 00:29:08 INFO mapred.JobClient:  map 56% reduce 0%
>    11/07/22 00:29:17 INFO mapred.JobClient:  map 57% reduce 0%
>    11/07/22 00:29:26 INFO mapred.JobClient:  map 58% reduce 0%
>    11/07/22 00:29:32 INFO mapred.JobClient:  map 59% reduce 0%
>    11/07/22 00:29:41 INFO mapred.JobClient:  map 60% reduce 0%
>    11/07/22 00:29:50 INFO mapred.JobClient:  map 61% reduce 0%
>    11/07/22 00:29:53 INFO mapred.JobClient:  map 62% reduce 0%
>    11/07/22 00:29:59 INFO mapred.JobClient:  map 63% reduce 0%
>    11/07/22 00:30:09 INFO mapred.JobClient:  map 64% reduce 0%
>    11/07/22 00:30:15 INFO mapred.JobClient:  map 65% reduce 0%
>    11/07/22 00:30:23 INFO mapred.JobClient:  map 66% reduce 0%
>    11/07/22 00:30:35 INFO mapred.JobClient:  map 67% reduce 0%
>    11/07/22 00:30:41 INFO mapred.JobClient:  map 68% reduce 0%
>    11/07/22 00:30:50 INFO mapred.JobClient:  map 69% reduce 0%
>    11/07/22 00:30:56 INFO mapred.JobClient:  map 70% reduce 0%
>    11/07/22 00:31:05 INFO mapred.JobClient:  map 71% reduce 0%
>    11/07/22 00:31:15 INFO mapred.JobClient:  map 72% reduce 0%
>    11/07/22 00:31:24 INFO mapred.JobClient:  map 73% reduce 0%
>    11/07/22 00:31:30 INFO mapred.JobClient:  map 74% reduce 0%
>    11/07/22 00:31:39 INFO mapred.JobClient:  map 75% reduce 0%
>    11/07/22 00:31:42 INFO mapred.JobClient:  map 76% reduce 0%
>    11/07/22 00:31:50 INFO mapred.JobClient:  map 77% reduce 0%
>    11/07/22 00:31:59 INFO mapred.JobClient:  map 78% reduce 0%
>    11/07/22 00:32:11 INFO mapred.JobClient:  map 79% reduce 0%
>    11/07/22 00:32:28 INFO mapred.JobClient:  map 80% reduce 0%
>    11/07/22 00:32:37 INFO mapred.JobClient:  map 81% reduce 0%
>    11/07/22 00:32:40 INFO mapred.JobClient:  map 82% reduce 0%
>    11/07/22 00:32:49 INFO mapred.JobClient:  map 83% reduce 0%
>    11/07/22 00:32:58 INFO mapred.JobClient:  map 84% reduce 0%
>    11/07/22 00:33:04 INFO mapred.JobClient:  map 85% reduce 0%
>    11/07/22 00:33:13 INFO mapred.JobClient:  map 86% reduce 0%
>    11/07/22 00:33:19 INFO mapred.JobClient:  map 87% reduce 0%
>    11/07/22 00:33:32 INFO mapred.JobClient:  map 88% reduce 0%
>    11/07/22 00:33:38 INFO mapred.JobClient:  map 89% reduce 0%
>    11/07/22 00:33:47 INFO mapred.JobClient:  map 90% reduce 0%
>    11/07/22 00:33:52 INFO mapred.JobClient:  map 91% reduce 0%
>    11/07/22 00:34:01 INFO mapred.JobClient:  map 92% reduce 0%
>    11/07/22 00:34:10 INFO mapred.JobClient:  map 93% reduce 0%
>    11/07/22 00:34:13 INFO mapred.JobClient:  map 94% reduce 0%
>    11/07/22 00:34:25 INFO mapred.JobClient:  map 95% reduce 0%
>    11/07/22 00:34:31 INFO mapred.JobClient:  map 96% reduce 0%
>    11/07/22 00:34:40 INFO mapred.JobClient:  map 97% reduce 0%
>    11/07/22 00:34:47 INFO mapred.JobClient:  map 98% reduce 0%
>    11/07/22 00:34:56 INFO mapred.JobClient:  map 99% reduce 0%
>    11/07/22 00:35:02 INFO mapred.JobClient:  map 100% reduce 0%
>    11/07/22 00:35:07 INFO mapred.JobClient: Task Id : attempt_201107211512_0029_m_000000_0,
Status : FAILED
>    org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid
local directory for output/file.out
>            at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:381)
>            at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:146)
>            at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:127)
>            at org.apache.hadoop.mapred.MapOutputFile.getOutputFileForWrite(MapOutputFile.java:69)
>            at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:1639)
>            at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1322)
>            at org.apache.hadoop.mapred.MapTask$NewOutputCollector.close(MapTask.java:698)
>            at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:765)
>            at org.apache.hadoop.mapred.MapTask.run(MapTask.java:369)
>            at org.apache.hadoop.mapred.Child$4.run(Child.java:259)
>            at java.security.AccessController.doPrivileged(Native Method)
>            at javax.security.auth.Subject.doAs(Subject.java:416)
>            at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
>            at org.apache.hadoop.mapred.Child.main(Child.java:253)
>
>
>    11/07/22 00:35:09 INFO mapred.JobClient:  map 0% reduce 0%
>    ...
>
>
>Q: What are the Hadoop heap settings you are using for your job?
>I am new to hadoop, not sure where to get those, but got these from localhost:50070, is
it right?
>147 files and directories, 60 blocks = 207 total. Heap Size is 31.57 MB / 966.69 MB (3%)
>
>
>p/s: i keep forgetting to include my operating environment, sorry. I basically run this
in a guest operating system (in a virtualbox virtual machine), assigned 1 CPU core, and 1.5GB
of memory. Then the host operating system is OS X 10.6.8 running on alubook (macbook late
2008 model) with 4GB of memory.
>
>
>    $ cat /etc/*-release
>    DISTRIB_ID=Ubuntu
>    DISTRIB_RELEASE=11.04
>    DISTRIB_CODENAME=natty
>    DISTRIB_DESCRIPTION="Ubuntu 11.04"
>    $ uname -a
>    Linux sensei 2.6.38-10-generic #46-Ubuntu SMP Tue Jun 28 15:05:41 UTC 2011 i686
i686 i386 GNU/Linux
>
>
>Best wishes,
>Jeffrey04
>
>>________________________________
>>From: Jeff Eastman <jeastman@Narus.com>
>>To: "user@mahout.apache.org" <user@mahout.apache.org>; Jeffrey <mycyberpet@yahoo.com>
>>Sent: Thursday, July 21, 2011 11:54 PM
>>Subject: RE: fkmeans or Cluster Dumper not working?
>>
>>Excellent, so this appears to be localized to fuzzyk. Unfortunately, the Apache mail
server strips off attachments so you'd need another mechanism (a JIRA?) to upload your data
if it is not too large. Some more questions in the interim:
>>
>>- What is the cardinality of your vector data?
>>- Is it sparse or dense?
>>- How many vectors are you trying to cluster?
>>- What is the exact error you see when fkmeans fails with k=10? With k=50?
>>- What are the Hadoop heap settings you are using for your job?
>>
>>-----Original Message-----
>>From: Jeffrey [mailto:mycyberpet@yahoo.com]
>>Sent: Thursday, July 21, 2011 11:21 AM
>>To: user@mahout.apache.org
>>Subject: Re: fkmeans or Cluster Dumper not
 working?
>>
>>Hi Jeff,
>>
>>Q: Did you change your invocation to specify a different -c directory (e.g. clusters-0)?
>>A: Yes :)
>>
>>Q: Did you add the -cl argument?
>>A: Yes :)
>>
>>$ bin/mahout fkmeans --input sensei/image-tag.arff.mvc --output sensei/clusters --clusters
sensei/clusters/clusters-0 --clustering --overwrite --emitMostLikely false --numClusters 5
--maxIter 10 --m 5
>>$ bin/mahout fkmeans --input sensei/image-tag.arff.mvc --output sensei/clusters --clusters
sensei/clusters/clusters-0 --clustering --overwrite --emitMostLikely false --numClusters 10
--maxIter 10 --m 5
>>$ bin/mahout fkmeans --input sensei/image-tag.arff.mvc --output sensei/clusters --clusters
sensei/clusters/clusters-0 --clustering --overwrite --emitMostLikely false --numClusters 50
--maxIter 10 --m 5
>>
>>Q: What is the new CLI invocation for clusterdump?
>>A:
>>$ bin/mahout clusterdump --seqFileDir sensei/clusters/clusters-4 --pointsDir
 sensei/clusters/clusteredPoints --output image-tag-clusters.txt
>>
>>
>>Q: Did this work for -k 10? What happens with -k 50?
>>A: works for k=5 (but i don't see the points), but not k=10, fkmeans fails when k=50,
so i can't dump when k=50
>>
>>Q: Have you tried kmeans?
>>A: Yes (all tested on 0.6-snapshot)
>>
>>k=5: no problem :)
>>k=10: no problem :)
>>k=50: no problem :)
>>
>>p/s: attached with the test data i used (in mvc format), let me know if you guys prefer
raw data in arff format
>>
>>Best wishes,
>>Jeffrey04
>>
>>
>>
>>>________________________________
>>>From: Jeff Eastman <jeastman@Narus.com>
>>>To: "user@mahout.apache.org" <user@mahout.apache.org>; Jeffrey <mycyberpet@yahoo.com>
>>>Sent: Thursday, July 21, 2011 9:36 PM
>>>Subject: RE: fkmeans or Cluster Dumper not working?
>>>
>>>You are correct, the wiki for fkmeans did not mention the -cl argument. I've added
that just now. I think this is what Frank means in his comment but you do *not* have to write
any custom code to get the cluster dumper to do what you want, just use the -cl argument and
specify clusteredPoints as the -p input to clusterdump.
>>>
>>>Check out TestClusterDumper.testKmeans and .testFuzzyKmeans. These show how to
invoke the clustering and cluster dumper from Java at least.
>>>
>>>Did you change your invocation to specify a different -c directory (e.g. clusters-0)?
>>>Did you add the -cl argument?
>>>What is the new CLI invocation for clusterdump?
>>>Did this work for -k 10? What happens with -k
 50?
>>>Have you tried kmeans?
>>>
>>>I can help you better if you will give me answers to my questions
>>>
>>>-----Original Message-----
>>>From: Jeffrey [mailto:mycyberpet@yahoo.com]
>>>Sent: Thursday, July 21, 2011 4:30 AM
>>>To: user@mahout.apache.org
>>>Subject: Re: fkmeans or Cluster Dumper not working?
>>>
>>>Hi again,
>>>
>>>Let me update on what's working and what's not working.
>>>
>>>Works:
>>>fkmeans clustering (10 clusters) - thanks Jeff for the --cl tip
>>>fkmeans clustering (5 clusters)
>>>clusterdump (5 clusters) - so points are not included in the clusterdump and I
need to write a program for it?
>>>
>>>Not Working:
>>>fkmeans clustering (50 clusters) - same error
>>>clusterdump (10
 clusters) - same error
>>>
>>>
>>>so it seems to attach points to the cluster dumper output like the synthetic control
example does, i would have to write some code as pointed by @Frank_Scholten ? https://twitter.com/#!/Frank_Scholten/status/93617269296472064
>>>
>>>Best wishes,
>>>Jeffrey04
>>>
>>>>________________________________
>>>>From: Jeff Eastman <jeastman@Narus.com>
>>>>To: "user@mahout.apache.org" <user@mahout.apache.org>; Jeffrey <mycyberpet@yahoo.com>
>>>>Sent: Wednesday, July 20, 2011 11:53 PM
>>>>Subject: RE: fkmeans or Cluster Dumper not working?
>>>>
>>>>Hi Jeffrey,
>>>>
>>>>It is always difficult to debug remotely, but here are some suggestions:
>>>>- First, you are specifying both an input clusters directory --clusters and
--numClusters clusters so the job is sampling 10 points from your input data set and writing
them to clusteredPoints as the prior clusters for the first iteration. You should pick a different
name for this directory, as the clusteredPoints directory is used by the -cl (--clustering)
option (which you did not supply) to write out the clustered (classified) input vectors. When
you subsequently supplied clusteredPoints to the clusterdumper it was expecting a different
format and that caused the exception you saw. Change your --clusters directory (clusters-0
is good)
 and add a -cl argument and things should go more smoothly. The -cl option is not the default
and so no clustering of the input points is performed without this (Many people get caught
by this and perhaps the default should be changed, but clustering can be expensive and so
it is not performed without request).
>>>>- If you still have problems, try again with k-means. The similarity to fkmeans
is good and it will eliminate fkmeans itself if you see the same problems with k-means
>>>>- I don't see why changing the -k argument from 10 to 50 should cause any
problems, unless your vectors are very large and you are getting an OME in the reducer. Since
the reducer is calculating centroid vectors for the next iteration these will become more
dense and memory will increase substantially.
>>>>- I can't figure out what might be causing your second exception. It is bombing
inside of Hadoop file IO and this causes me to suspect command argument
 problems.
>>>>
>>>>Hope this helps,
>>>>Jeff
>>>>
>>>>
>>>>-----Original Message-----
>>>>From: Jeffrey [mailto:mycyberpet@yahoo.com]
>>>>Sent: Wednesday, July 20, 2011 2:41 AM
>>>>To: user@mahout.apache.org
>>>>Subject: fkmeans or Cluster Dumper not working?
>>>>
>>>>Hi,
>>>>
>>>>I am trying to generate clusters using the fkmeans command line tool from
my test data. Not sure if this is correct, as it only runs one iteration (output from 0.6-snapshot,
gotta use some workaround to some weird bug - http://search.lucidimagination.com/search/document/d95ff0c29ac4a8a7/bug_in_fkmeans
)
>>>>
>>>>$ bin/mahout fkmeans --input sensei/image-tag.arff.mvc --output sensei/clusters
--clusters sensei/clusteredPoints --maxIter 10 --numClusters 10 --overwrite --m 5
>>>>Running on hadoop, using HADOOP_HOME=/home/jeffrey04/Applications/hadoop-0.20.203.0HADOOP_CONF_DIR=/home/jeffrey04/Applications/hadoop-0.20.203.0/confMAHOUT-JOB:
/home/jeffrey04/Applications/mahout/examples/target/mahout-examples-0.6-SNAPSHOT-job.jar11/07/20
14:05:18 INFO common.AbstractJob: Command line arguments: {--clusters=sensei/clusteredPoints,
--convergenceDelta=0.5, --distanceMeasure=org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure,
--emitMostLikely=true, --endPhase=2147483647, --input=sensei/image-tag.arff.mvc, --m=5, --maxIter=10,
--method=mapreduce, --numClusters=10, --output=sensei/clusters, --overwrite=null, --startPhase=0,
--tempDir=temp, --threshold=0}11/07/20 14:05:20 INFO common.HadoopUtil: Deleting sensei/clusters11/07/20
 14:05:20 INFO common.HadoopUtil: Deleting sensei/clusteredPoints11/07/20 14:05:20 INFO util.NativeCodeLoader:
Loaded the native-hadoop library11/07/20 14:05:20 INFO zlib.ZlibFactory: Successfully
>>>>loaded & initialized native-zlib library11/07/20 14:05:20 INFO compress.CodecPool:
Got brand-new compressor11/07/20 14:05:20 INFO compress.CodecPool: Got brand-new decompressor
>>>>11/07/20 14:05:29 INFO kmeans.RandomSeedGenerator: Wrote 10 vectors to sensei/clusteredPoints/part-randomSeed
>>>>11/07/20 14:05:29 INFO fuzzykmeans.FuzzyKMeansDriver: Fuzzy K-Means Iteration
1
>>>>11/07/20 14:05:30 INFO input.FileInputFormat: Total input paths to process
: 1
>>>>11/07/20 14:05:30 INFO mapred.JobClient: Running job: job_201107201152_0021
>>>>11/07/20 14:05:31 INFO mapred.JobClient:  map 0% reduce 0%
>>>>11/07/20 14:05:54 INFO mapred.JobClient:  map 2% reduce 0%
>>>>11/07/20 14:05:57 INFO
 mapred.JobClient:  map 5% reduce 0%
>>>>11/07/20 14:06:00 INFO mapred.JobClient:  map 6% reduce 0%
>>>>11/07/20 14:06:03 INFO mapred.JobClient:  map 7% reduce 0%
>>>>11/07/20 14:06:07 INFO mapred.JobClient:  map 10% reduce 0%
>>>>11/07/20 14:06:10 INFO mapred.JobClient:  map 13% reduce 0%
>>>>11/07/20 14:06:13 INFO mapred.JobClient:  map 15% reduce 0%
>>>>11/07/20 14:06:16 INFO mapred.JobClient:  map 17% reduce 0%
>>>>11/07/20 14:06:19 INFO mapred.JobClient:  map 19% reduce 0%
>>>>11/07/20 14:06:22 INFO mapred.JobClient:  map 23% reduce 0%
>>>>11/07/20 14:06:25 INFO mapred.JobClient:  map 25% reduce 0%
>>>>11/07/20 14:06:28 INFO mapred.JobClient:  map 27% reduce 0%
>>>>11/07/20 14:06:31 INFO mapred.JobClient:  map 30% reduce 0%
>>>>11/07/20 14:06:34 INFO mapred.JobClient:  map 33% reduce
 0%
>>>>11/07/20 14:06:37 INFO mapred.JobClient:  map 36% reduce 0%
>>>>11/07/20 14:06:40 INFO mapred.JobClient:  map 37% reduce 0%
>>>>11/07/20 14:06:43 INFO mapred.JobClient:  map 40% reduce 0%
>>>>11/07/20 14:06:46 INFO mapred.JobClient:  map 43% reduce 0%
>>>>11/07/20 14:06:49 INFO mapred.JobClient:  map 46% reduce 0%
>>>>11/07/20 14:06:52 INFO mapred.JobClient:  map 48% reduce 0%
>>>>11/07/20 14:06:55 INFO mapred.JobClient:  map 50% reduce 0%
>>>>11/07/20 14:06:57 INFO mapred.JobClient:  map 53% reduce 0%
>>>>11/07/20 14:07:00 INFO mapred.JobClient:  map 56% reduce 0%
>>>>11/07/20 14:07:03 INFO mapred.JobClient:  map 58% reduce 0%
>>>>11/07/20 14:07:06 INFO mapred.JobClient:  map 60% reduce 0%
>>>>11/07/20 14:07:09 INFO mapred.JobClient:  map 63% reduce 0%
>>>>11/07/20 14:07:13 INFO
 mapred.JobClient:  map 65% reduce 0%
>>>>11/07/20 14:07:16 INFO mapred.JobClient:  map 67% reduce 0%
>>>>11/07/20 14:07:19 INFO mapred.JobClient:  map 70% reduce 0%
>>>>11/07/20 14:07:22 INFO mapred.JobClient:  map 73% reduce 0%
>>>>11/07/20 14:07:25 INFO mapred.JobClient:  map 75% reduce 0%
>>>>11/07/20 14:07:28 INFO mapred.JobClient:  map 77% reduce 0%
>>>>11/07/20 14:07:31 INFO mapred.JobClient:  map 80% reduce 0%
>>>>11/07/20 14:07:34 INFO mapred.JobClient:  map 83% reduce 0%
>>>>11/07/20 14:07:37 INFO mapred.JobClient:  map 85% reduce 0%
>>>>11/07/20 14:07:40 INFO mapred.JobClient:  map 87% reduce 0%
>>>>11/07/20 14:07:43 INFO mapred.JobClient:  map 89% reduce 0%
>>>>11/07/20 14:07:46 INFO mapred.JobClient:  map 92% reduce 0%
>>>>11/07/20 14:07:49 INFO mapred.JobClient:  map 95% reduce
 0%
>>>>11/07/20 14:07:55 INFO mapred.JobClient:  map 98% reduce 0%
>>>>11/07/20 14:07:59 INFO mapred.JobClient:  map 99% reduce 0%
>>>>11/07/20 14:08:02 INFO mapred.JobClient:  map 100% reduce 0%
>>>>11/07/20 14:08:23 INFO mapred.JobClient:  map 100% reduce 100%
>>>>11/07/20 14:08:31 INFO mapred.JobClient: Job complete: job_201107201152_0021
>>>>11/07/20 14:08:31 INFO mapred.JobClient: Counters: 26
>>>>11/07/20 14:08:31 INFO mapred.JobClient:   Job Counters
>>>>11/07/20 14:08:31 INFO mapred.JobClient:     Launched reduce tasks=1
>>>>11/07/20 14:08:31 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=149314
>>>>11/07/20 14:08:31 INFO mapred.JobClient:     Total time spent by all reduces
waiting after reserving slots (ms)=0
>>>>11/07/20 14:08:31 INFO mapred.JobClient:     Total time spent by all maps
waiting after
 reserving slots (ms)=0
>>>>11/07/20 14:08:31 INFO mapred.JobClient:     Launched map tasks=1
>>>>11/07/20 14:08:31 INFO mapred.JobClient:     Data-local map tasks=1
>>>>11/07/20 14:08:31 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=15618
>>>>11/07/20 14:08:31 INFO mapred.JobClient:   File Output Format Counters
>>>>11/07/20 14:08:31 INFO mapred.JobClient:     Bytes Written=2247222
>>>>11/07/20 14:08:31 INFO mapred.JobClient:   Clustering
>>>>11/07/20 14:08:31 INFO mapred.JobClient:     Converged Clusters=10
>>>>11/07/20 14:08:31 INFO mapred.JobClient:   FileSystemCounters
>>>>11/07/20 14:08:31 INFO mapred.JobClient:     FILE_BYTES_READ=130281382
>>>>11/07/20 14:08:31 INFO mapred.JobClient:     HDFS_BYTES_READ=254494
>>>>11/07/20 14:08:31 INFO mapred.JobClient:    
 FILE_BYTES_WRITTEN=132572666
>>>>11/07/20 14:08:31 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=2247222
>>>>11/07/20 14:08:31 INFO mapred.JobClient:   File Input Format Counters
>>>>11/07/20 14:08:31 INFO mapred.JobClient:     Bytes Read=247443
>>>>11/07/20 14:08:31 INFO mapred.JobClient:   Map-Reduce Framework
>>>>11/07/20 14:08:31 INFO mapred.JobClient:     Reduce input groups=10
>>>>11/07/20 14:08:31 INFO mapred.JobClient:     Map output materialized bytes=2246233
>>>>11/07/20 14:08:32 INFO mapred.JobClient:     Combine output records=330
>>>>11/07/20 14:08:32 INFO mapred.JobClient:     Map input records=1113
>>>>11/07/20 14:08:32 INFO mapred.JobClient:     Reduce shuffle bytes=2246233
>>>>11/07/20 14:08:32 INFO mapred.JobClient:     Reduce output records=10
>>>>11/07/20 14:08:32 INFO
 mapred.JobClient:     Spilled Records=590
>>>>11/07/20 14:08:32 INFO mapred.JobClient:     Map output bytes=2499995001
>>>>11/07/20 14:08:32 INFO mapred.JobClient:     Combine input records=11450
>>>>11/07/20 14:08:32 INFO mapred.JobClient:     Map output records=11130
>>>>11/07/20 14:08:32 INFO mapred.JobClient:     SPLIT_RAW_BYTES=127
>>>>11/07/20 14:08:32 INFO mapred.JobClient:     Reduce input records=10
>>>>11/07/20 14:08:32 INFO driver.MahoutDriver: Program took 194096 ms
>>>>
>>>>if I increase the --numClusters argument (e.g. 50), then it will return exception
after
>>>>11/07/20 14:08:02 INFO mapred.JobClient:  map 100% reduce 0%
>>>>
>>>>and would retry again (also reproducible using 0.6-snapshot)
>>>>
>>>>...
>>>>11/07/20 14:22:25 INFO mapred.JobClient:  map 100% reduce
 0%
>>>>11/07/20 14:22:30 INFO mapred.JobClient: Task Id : attempt_201107201152_0022_m_000000_0,
Status : FAILED
>>>>org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any
valid local directory for output/file.out
>>>>        at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:381)
>>>>        at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:146)
>>>>        at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:127)
>>>>        at org.apache.hadoop.mapred.MapOutputFile.getOutputFileForWrite(MapOutputFile.java:69)
>>>>        at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:1639)
>>>>        at
 org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1322)
>>>>        at org.apache.hadoop.mapred.MapTask$NewOutputCollector.close(MapTask.java:698)
>>>>        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:765)
>>>>        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:369)
>>>>        at org.apache.hadoop.mapred.Child$4.run(Child.java:259)
>>>>        at java.security.AccessController.doPrivileged(Native Method)
>>>>        at javax.security.auth.Subject.doAs(Subject.java:416)
>>>>        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
>>>>        at org.apache.hadoop.mapred.Child.main(Child.java:253)
>>>>
>>>>11/07/20 14:22:32 INFO
 mapred.JobClient:  map 0% reduce 0%
>>>>...
>>>>
>>>>Then I ran cluster dumper to dump information about the clusters, this command
would work if I only care about the cluster centroids (both 0.5 release and 0.6-snapshot)
>>>>
>>>>$ bin/mahout clusterdump --seqFileDir sensei/clusters/clusters-1 --output
image-tag-clusters.txt
>>>>Running on hadoop, using HADOOP_HOME=/home/jeffrey04/Applications/hadoop-0.20.203.0
>>>>HADOOP_CONF_DIR=/home/jeffrey04/Applications/hadoop-0.20.203.0/conf
>>>>MAHOUT-JOB: /home/jeffrey04/Applications/mahout/examples/target/mahout-examples-0.6-SNAPSHOT-job.jar
>>>>11/07/20 14:33:45 INFO common.AbstractJob: Command line arguments: {--dictionaryType=text,
--endPhase=2147483647, --output=image-tag-clusters.txt, --seqFileDir=sensei/clusters/clusters-1,
--startPhase=0, --tempDir=temp}
>>>>11/07/20 14:33:56 INFO driver.MahoutDriver: Program took 11761
 ms
>>>>
>>>>but if I want to see the degree of membership of each points, I get another
exception (yes, reproducible for both 0.5 release and 0.6-snapshot)
>>>>
>>>>$ bin/mahout clusterdump --seqFileDir sensei/clusters/clusters-1 --output
image-tag-clusters.txt --pointsDir sensei/clusteredPoints
>>>>Running on hadoop, using HADOOP_HOME=/home/jeffrey04/Applications/hadoop-0.20.203.0
>>>>HADOOP_CONF_DIR=/home/jeffrey04/Applications/hadoop-0.20.203.0/conf
>>>>MAHOUT-JOB: /home/jeffrey04/Applications/mahout/examples/target/mahout-examples-0.6-SNAPSHOT-job.jar
>>>>11/07/20 14:35:08 INFO common.AbstractJob: Command line arguments: {--dictionaryType=text,
--endPhase=2147483647, --output=image-tag-clusters.txt, --pointsDir=sensei/clusteredPoints,
--seqFileDir=sensei/clusters/clusters-1, --startPhase=0, --tempDir=temp}
>>>>11/07/20 14:35:10 INFO util.NativeCodeLoader: Loaded the native-hadoop
 library
>>>>11/07/20 14:35:10 INFO zlib.ZlibFactory: Successfully loaded & initialized
native-zlib library
>>>>11/07/20 14:35:10 INFO compress.CodecPool: Got brand-new decompressor
>>>>Exception in thread "main" java.lang.ClassCastException: org.apache.hadoop.io.Text
cannot be cast to org.apache.hadoop.io.IntWritable
>>>>        at org.apache.mahout.utils.clustering.ClusterDumper.readPoints(ClusterDumper.java:261)
>>>>        at org.apache.mahout.utils.clustering.ClusterDumper.init(ClusterDumper.java:209)
>>>>        at org.apache.mahout.utils.clustering.ClusterDumper.run(ClusterDumper.java:123)
>>>>        at org.apache.mahout.utils.clustering.ClusterDumper.main(ClusterDumper.java:89)
>>>>        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>>     
   at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>>>>        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>>>        at java.lang.reflect.Method.invoke(Method.java:616)
>>>>        at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
>>>>        at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
>>>>        at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:188)
>>>>        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>>        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>>>>        at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>>>        at java.lang.reflect.Method.invoke(Method.java:616)
>>>>        at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
>>>>
>>>>erm, would writing a short program to call the API (btw, can't seem to find
the latest API doc?) be a better choice here? Or did I do anything wrong here (yes, Java is
not my main language, and I am very new to Mahout.. and h)?
>>>>
>>>>the data is converted from an arff file with about 1000 rows (resource) and
14k columns (tag), and it is just a subset of my data. (actually made a mistake so it is now
generating resource clusters instead of tag clusters, but I am just doing this as a proof
of concept whether mahout is good enough for the task)
>>>>
>>>>Best
 wishes,
>>>>Jeffrey04
>>>>
>>>>
>>>>
>>>
>>>
>>>
>>
>>
>>
>
>
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message