mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jeffrey <mycyber...@yahoo.com>
Subject Re: fkmeans or Cluster Dumper not working?
Date Wed, 27 Jul 2011 07:18:19 GMT
erm, is there any workaround to the problem?


----- Original Message -----
> From: Jeff Eastman <jeastman@Narus.com>
> To: "user@mahout.apache.org" <user@mahout.apache.org>
> Cc: 
> Sent: Tuesday, July 26, 2011 1:12 PM
> Subject: RE: fkmeans or Cluster Dumper not working?
> 
> Also makes sense that fuzzyk centroids would be completely dense, since every 
> point is a member of every cluster. My reducer heaps are 4G.
> 
> -----Original Message-----
> From: Jeff Eastman [mailto:jeastman@Narus.com]
> Sent: Monday, July 25, 2011 2:32 PM
> To: user@mahout.apache.org; Jeffrey
> Subject: RE: fkmeans or Cluster Dumper not working?
> 
> I'm able to run fuzzyk on your data set with k=10 and k=50 without problems. 
> I also ran it fine with k=100 just to push it a bit harder. Runs took longer as 
> k increased as expected (39s, 2m50s, 5m57s) as did the clustering (11s, 45s, 
> 1m11s). The cluster dumper is throwing an OME with your data points and probably 
> also with the larger cluster volumes, suggesting it needs a larger -Xmx value 
> since it is running locally and not influenced by the cluster vm parameters.
> 
> I will try some more and keep you updated.
> 
> The cluster dumper is throwing an OME trying to inhale all your data points. It 
> is running locally
> 
> -----Original Message-----
> From: Jeffrey [mailto:mycyberpet@yahoo.com]
> Sent: Sunday, July 24, 2011 12:51 AM
> To: user@mahout.apache.org
> Subject: Re: fkmeans or Cluster Dumper not working?
> 
> Erm, is there any update? is the problem reproducible?
> 
> Best wishes,
> Jeffrey04
> 
> 
> 
>> ________________________________
>> From: Jeffrey <mycyberpet@yahoo.com>
>> To: Jeff Eastman <jeastman@Narus.com>; 
> "user@mahout.apache.org" <user@mahout.apache.org>
>> Sent: Friday, July 22, 2011 12:40 AM
>> Subject: Re: fkmeans or Cluster Dumper not working?
>> 
>> 
>> Hi Jeff,
>> 
>> 
>> lol, this is probably my last reply before i fall asleep (GMT+8 here).
>> 
>> 
>> First thing first, data file is here: http://coolsilon.com/image-tag.mvc
>> 
>> 
>> Q: What is the cardinality of your vector data?
>> about 1000+ rows (resources) * 14 000+ columns (tags)
>> Q: Is it sparse or dense?
>> sparse (assuming sparse = each vector contains mostly 0)
>> Q: How many vectors are you trying to cluster?
>> all of them? (1000+ rows)
>> Q: What is the exact error you see when fkmeans fails with k=10? With k=50?
>> i think i posted the exception when k=50, but will post them again here
>> 
>> 
>> k=10, fkmeans actually works, but cluster dumper returns exception, however, 
> if i take out --pointsDir, then it would work (output looks ok, but without all 
> the points)
>> 
>> 
>>     $ bin/mahout fkmeans --input sensei/image-tag.arff.mvc --output 
> sensei/clusters --clusters sensei/clusters/clusters-0 --clustering --overwrite 
> --emitMostLikely false --numClusters 10 --maxIter 10 --m 5
>>     ...
>>     $ bin/mahout clusterdump --seqFileDir sensei/clusters/clusters-1 
> --pointsDir sensei/clusters/clusteredPoints --output image-tag-clusters.txt 
> Running on hadoop, using 
> HADOOP_HOME=/home/jeffrey04/Applications/hadoop-0.20.203.0
>>     HADOOP_CONF_DIR=/home/jeffrey04/Applications/hadoop-0.20.203.0/conf
>>     MAHOUT-JOB: 
> /home/jeffrey04/Applications/mahout/examples/target/mahout-examples-0.6-SNAPSHOT-job.jar
>>     11/07/22 00:14:50 INFO common.AbstractJob: Command line arguments: 
> {--dictionaryType=text, --endPhase=2147483647, --output=image-tag-clusters.txt, 
> --pointsDir=sensei/clusters/clusteredPoints, 
> --seqFileDir=sensei/clusters/clusters-1, --startPhase=0, --tempDir=temp}
>>     Exception in thread "main" java.lang.OutOfMemoryError: Java 
> heap space
>>             at java.lang.Object.clone(Native Method)
>>             at 
> org.apache.mahout.math.DenseVector.<init>(DenseVector.java:44)
>>             at 
> org.apache.mahout.math.DenseVector.<init>(DenseVector.java:39)
>>             at 
> org.apache.mahout.math.VectorWritable.readFields(VectorWritable.java:94)
>>             at 
> org.apache.mahout.clustering.WeightedVectorWritable.readFields(WeightedVectorWritable.java:55)
>>             at 
> org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:1751)
>>             at 
> org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1879)
>>             at 
> org.apache.mahout.common.iterator.sequencefile.SequenceFileIterator.computeNext(SequenceFileIterator.java:95)
>>             at 
> org.apache.mahout.common.iterator.sequencefile.SequenceFileIterator.computeNext(SequenceFileIterator.java:38)
>>             at 
> com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:141)
>>             at 
> com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:136)
>>             at 
> com.google.common.collect.Iterators$5.hasNext(Iterators.java:525)
>>             at 
> com.google.common.collect.ForwardingIterator.hasNext(ForwardingIterator.java:43)
>>             at 
> org.apache.mahout.utils.clustering.ClusterDumper.readPoints(ClusterDumper.java:255)
>>             at 
> org.apache.mahout.utils.clustering.ClusterDumper.init(ClusterDumper.java:209)
>>             at 
> org.apache.mahout.utils.clustering.ClusterDumper.run(ClusterDumper.java:123)
>>             at 
> org.apache.mahout.utils.clustering.ClusterDumper.main(ClusterDumper.java:89)
>>             at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>             at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>>             at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>             at java.lang.reflect.Method.invoke(Method.java:616)
>>             at 
> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
>>             at 
> org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
>>             at 
> org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:188)
>>             at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>             at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>>             at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>             at java.lang.reflect.Method.invoke(Method.java:616)
>>             at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
>>     $ bin/mahout clusterdump --seqFileDir sensei/clusters/clusters-1 
> --output image-tag-clusters.txt Running on hadoop, using 
> HADOOP_HOME=/home/jeffrey04/Applications/hadoop-0.20.203.0
>>     HADOOP_CONF_DIR=/home/jeffrey04/Applications/hadoop-0.20.203.0/conf
>>     MAHOUT-JOB: 
> /home/jeffrey04/Applications/mahout/examples/target/mahout-examples-0.6-SNAPSHOT-job.jar
>>     11/07/22 00:19:04 INFO common.AbstractJob: Command line arguments: 
> {--dictionaryType=text, --endPhase=2147483647, --output=image-tag-clusters.txt, 
> --seqFileDir=sensei/clusters/clusters-1, --startPhase=0, --tempDir=temp}
>>     11/07/22 00:19:13 INFO driver.MahoutDriver: Program took 9504 ms
>> 
>> 
>> k=50, fkmeans shows exception after map 100% reduce 0%, and would retry (map 
> 0% reduce 0%) after the exception
>> 
>> 
>>     $ bin/mahout fkmeans --input sensei/image-tag.arff.mvc --output 
> sensei/clusters --clusters sensei/clusters/clusters-0 --clustering --overwrite 
> --emitMostLikely false --numClusters 50 --maxIter 10 --m 5
>>     Running on hadoop, using 
> HADOOP_HOME=/home/jeffrey04/Applications/hadoop-0.20.203.0
>>     HADOOP_CONF_DIR=/home/jeffrey04/Applications/hadoop-0.20.203.0/conf
>>     MAHOUT-JOB: 
> /home/jeffrey04/Applications/mahout/examples/target/mahout-examples-0.6-SNAPSHOT-job.jar
>>     11/07/22 00:21:07 INFO common.AbstractJob: Command line arguments: 
> {--clustering=null, --clusters=sensei/clusters/clusters-0, 
> --convergenceDelta=0.5, 
> --distanceMeasure=org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure,

> --emitMostLikely=false, --endPhase=2147483647, 
> --input=sensei/image-tag.arff.mvc, --m=5, --maxIter=10, --method=mapreduce, 
> --numClusters=50, --output=sensei/clusters, --overwrite=null, --startPhase=0, 
> --tempDir=temp, --threshold=0}
>>     11/07/22 00:21:09 INFO common.HadoopUtil: Deleting sensei/clusters
>>     11/07/22 00:21:09 INFO util.NativeCodeLoader: Loaded the native-hadoop 
> library
>>     11/07/22 00:21:09 INFO zlib.ZlibFactory: Successfully loaded & 
> initialized native-zlib library
>>     11/07/22 00:21:09 INFO compress.CodecPool: Got brand-new compressor
>>     11/07/22 00:21:10 INFO compress.CodecPool: Got brand-new decompressor
>>     11/07/22 00:21:21 INFO kmeans.RandomSeedGenerator: Wrote 50 vectors to 
> sensei/clusters/clusters-0/part-randomSeed
>>     11/07/22 00:21:24 INFO fuzzykmeans.FuzzyKMeansDriver: Fuzzy K-Means 
> Iteration 1
>>     11/07/22 00:21:25 INFO input.FileInputFormat: Total input paths to 
> process : 1
>>     11/07/22 00:21:26 INFO mapred.JobClient: Running job: 
> job_201107211512_0029
>>     11/07/22 00:21:27 INFO mapred.JobClient:  map 0% reduce 0%
>>     11/07/22 00:22:08 INFO mapred.JobClient:  map 1% reduce 0%
>>     11/07/22 00:22:20 INFO mapred.JobClient:  map 2% reduce 0%
>>     11/07/22 00:22:33 INFO mapred.JobClient:  map 3% reduce 0%
>>     11/07/22 00:22:42 INFO mapred.JobClient:  map 4% reduce 0%
>>     11/07/22 00:22:50 INFO mapred.JobClient:  map 5% reduce 0%
>>     11/07/22 00:23:00 INFO mapred.JobClient:  map 6% reduce 0%
>>     11/07/22 00:23:09 INFO mapred.JobClient:  map 7% reduce 0%
>>     11/07/22 00:23:18 INFO mapred.JobClient:  map 8% reduce 0%
>>     11/07/22 00:23:27 INFO mapred.JobClient:  map 9% reduce 0%
>>     11/07/22 00:23:33 INFO mapred.JobClient:  map 10% reduce 0%
>>     11/07/22 00:23:42 INFO mapred.JobClient:  map 11% reduce 0%
>>     11/07/22 00:23:45 INFO mapred.JobClient:  map 12% reduce 0%
>>     11/07/22 00:23:54 INFO mapred.JobClient:  map 13% reduce 0%
>>     11/07/22 00:24:03 INFO mapred.JobClient:  map 14% reduce 0%
>>     11/07/22 00:24:09 INFO mapred.JobClient:  map 15% reduce 0%
>>     11/07/22 00:24:15 INFO mapred.JobClient:  map 16% reduce 0%
>>     11/07/22 00:24:24 INFO mapred.JobClient:  map 17% reduce 0%
>>     11/07/22 00:24:30 INFO mapred.JobClient:  map 18% reduce 0%
>>     11/07/22 00:24:42 INFO mapred.JobClient:  map 19% reduce 0%
>>     11/07/22 00:24:51 INFO mapred.JobClient:  map 20% reduce 0%
>>     11/07/22 00:24:57 INFO mapred.JobClient:  map 21% reduce 0%
>>     11/07/22 00:25:06 INFO mapred.JobClient:  map 22% reduce 0%
>>     11/07/22 00:25:09 INFO mapred.JobClient:  map 23% reduce 0%
>>     11/07/22 00:25:19 INFO mapred.JobClient:  map 24% reduce 0%
>>     11/07/22 00:25:25 INFO mapred.JobClient:  map 25% reduce 0%
>>     11/07/22 00:25:31 INFO mapred.JobClient:  map 26% reduce 0%
>>     11/07/22 00:25:37 INFO mapred.JobClient:  map 27% reduce 0%
>>     11/07/22 00:25:43 INFO mapred.JobClient:  map 28% reduce 0%
>>     11/07/22 00:25:51 INFO mapred.JobClient:  map 29% reduce 0%
>>     11/07/22 00:25:58 INFO mapred.JobClient:  map 30% reduce 0%
>>     11/07/22 00:26:04 INFO mapred.JobClient:  map 31% reduce 0%
>>     11/07/22 00:26:10 INFO mapred.JobClient:  map 32% reduce 0%
>>     11/07/22 00:26:19 INFO mapred.JobClient:  map 33% reduce 0%
>>     11/07/22 00:26:25 INFO mapred.JobClient:  map 34% reduce 0%
>>     11/07/22 00:26:34 INFO mapred.JobClient:  map 35% reduce 0%
>>     11/07/22 00:26:40 INFO mapred.JobClient:  map 36% reduce 0%
>>     11/07/22 00:26:49 INFO mapred.JobClient:  map 37% reduce 0%
>>     11/07/22 00:26:55 INFO mapred.JobClient:  map 38% reduce 0%
>>     11/07/22 00:27:04 INFO mapred.JobClient:  map 39% reduce 0%
>>     11/07/22 00:27:14 INFO mapred.JobClient:  map 40% reduce 0%
>>     11/07/22 00:27:23 INFO mapred.JobClient:  map 41% reduce 0%
>>     11/07/22 00:27:28 INFO mapred.JobClient:  map 42% reduce 0%
>>     11/07/22 00:27:34 INFO mapred.JobClient:  map 43% reduce 0%
>>     11/07/22 00:27:40 INFO mapred.JobClient:  map 44% reduce 0%
>>     11/07/22 00:27:49 INFO mapred.JobClient:  map 45% reduce 0%
>>     11/07/22 00:27:56 INFO mapred.JobClient:  map 46% reduce 0%
>>     11/07/22 00:28:05 INFO mapred.JobClient:  map 47% reduce 0%
>>     11/07/22 00:28:11 INFO mapred.JobClient:  map 48% reduce 0%
>>     11/07/22 00:28:20 INFO mapred.JobClient:  map 49% reduce 0%
>>     11/07/22 00:28:26 INFO mapred.JobClient:  map 50% reduce 0%
>>     11/07/22 00:28:35 INFO mapred.JobClient:  map 51% reduce 0%
>>     11/07/22 00:28:41 INFO mapred.JobClient:  map 52% reduce 0%
>>     11/07/22 00:28:47 INFO mapred.JobClient:  map 53% reduce 0%
>>     11/07/22 00:28:53 INFO mapred.JobClient:  map 54% reduce 0%
>>     11/07/22 00:29:02 INFO mapred.JobClient:  map 55% reduce 0%
>>     11/07/22 00:29:08 INFO mapred.JobClient:  map 56% reduce 0%
>>     11/07/22 00:29:17 INFO mapred.JobClient:  map 57% reduce 0%
>>     11/07/22 00:29:26 INFO mapred.JobClient:  map 58% reduce 0%
>>     11/07/22 00:29:32 INFO mapred.JobClient:  map 59% reduce 0%
>>     11/07/22 00:29:41 INFO mapred.JobClient:  map 60% reduce 0%
>>     11/07/22 00:29:50 INFO mapred.JobClient:  map 61% reduce 0%
>>     11/07/22 00:29:53 INFO mapred.JobClient:  map 62% reduce 0%
>>     11/07/22 00:29:59 INFO mapred.JobClient:  map 63% reduce 0%
>>     11/07/22 00:30:09 INFO mapred.JobClient:  map 64% reduce 0%
>>     11/07/22 00:30:15 INFO mapred.JobClient:  map 65% reduce 0%
>>     11/07/22 00:30:23 INFO mapred.JobClient:  map 66% reduce 0%
>>     11/07/22 00:30:35 INFO mapred.JobClient:  map 67% reduce 0%
>>     11/07/22 00:30:41 INFO mapred.JobClient:  map 68% reduce 0%
>>     11/07/22 00:30:50 INFO mapred.JobClient:  map 69% reduce 0%
>>     11/07/22 00:30:56 INFO mapred.JobClient:  map 70% reduce 0%
>>     11/07/22 00:31:05 INFO mapred.JobClient:  map 71% reduce 0%
>>     11/07/22 00:31:15 INFO mapred.JobClient:  map 72% reduce 0%
>>     11/07/22 00:31:24 INFO mapred.JobClient:  map 73% reduce 0%
>>     11/07/22 00:31:30 INFO mapred.JobClient:  map 74% reduce 0%
>>     11/07/22 00:31:39 INFO mapred.JobClient:  map 75% reduce 0%
>>     11/07/22 00:31:42 INFO mapred.JobClient:  map 76% reduce 0%
>>     11/07/22 00:31:50 INFO mapred.JobClient:  map 77% reduce 0%
>>     11/07/22 00:31:59 INFO mapred.JobClient:  map 78% reduce 0%
>>     11/07/22 00:32:11 INFO mapred.JobClient:  map 79% reduce 0%
>>     11/07/22 00:32:28 INFO mapred.JobClient:  map 80% reduce 0%
>>     11/07/22 00:32:37 INFO mapred.JobClient:  map 81% reduce 0%
>>     11/07/22 00:32:40 INFO mapred.JobClient:  map 82% reduce 0%
>>     11/07/22 00:32:49 INFO mapred.JobClient:  map 83% reduce 0%
>>     11/07/22 00:32:58 INFO mapred.JobClient:  map 84% reduce 0%
>>     11/07/22 00:33:04 INFO mapred.JobClient:  map 85% reduce 0%
>>     11/07/22 00:33:13 INFO mapred.JobClient:  map 86% reduce 0%
>>     11/07/22 00:33:19 INFO mapred.JobClient:  map 87% reduce 0%
>>     11/07/22 00:33:32 INFO mapred.JobClient:  map 88% reduce 0%
>>     11/07/22 00:33:38 INFO mapred.JobClient:  map 89% reduce 0%
>>     11/07/22 00:33:47 INFO mapred.JobClient:  map 90% reduce 0%
>>     11/07/22 00:33:52 INFO mapred.JobClient:  map 91% reduce 0%
>>     11/07/22 00:34:01 INFO mapred.JobClient:  map 92% reduce 0%
>>     11/07/22 00:34:10 INFO mapred.JobClient:  map 93% reduce 0%
>>     11/07/22 00:34:13 INFO mapred.JobClient:  map 94% reduce 0%
>>     11/07/22 00:34:25 INFO mapred.JobClient:  map 95% reduce 0%
>>     11/07/22 00:34:31 INFO mapred.JobClient:  map 96% reduce 0%
>>     11/07/22 00:34:40 INFO mapred.JobClient:  map 97% reduce 0%
>>     11/07/22 00:34:47 INFO mapred.JobClient:  map 98% reduce 0%
>>     11/07/22 00:34:56 INFO mapred.JobClient:  map 99% reduce 0%
>>     11/07/22 00:35:02 INFO mapred.JobClient:  map 100% reduce 0%
>>     11/07/22 00:35:07 INFO mapred.JobClient: Task Id : 
> attempt_201107211512_0029_m_000000_0, Status : FAILED
>>     org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find 
> any valid local directory for output/file.out
>>             at 
> org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:381)
>>             at 
> org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:146)
>>             at 
> org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:127)
>>             at 
> org.apache.hadoop.mapred.MapOutputFile.getOutputFileForWrite(MapOutputFile.java:69)
>>             at 
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:1639)
>>             at 
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1322)
>>             at 
> org.apache.hadoop.mapred.MapTask$NewOutputCollector.close(MapTask.java:698)
>>             at 
> org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:765)
>>             at org.apache.hadoop.mapred.MapTask.run(MapTask.java:369)
>>             at org.apache.hadoop.mapred.Child$4.run(Child.java:259)
>>             at java.security.AccessController.doPrivileged(Native Method)
>>             at javax.security.auth.Subject.doAs(Subject.java:416)
>>             at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
>>             at org.apache.hadoop.mapred.Child.main(Child.java:253)
>> 
>> 
>>     11/07/22 00:35:09 INFO mapred.JobClient:  map 0% reduce 0%
>>     ...
>> 
>> 
>> Q: What are the Hadoop heap settings you are using for your job?
>> I am new to hadoop, not sure where to get those, but got these from 
> localhost:50070, is it right?
>> 147 files and directories, 60 blocks = 207 total. Heap Size is 31.57 MB / 
> 966.69 MB (3%)
>> 
>> 
>> p/s: i keep forgetting to include my operating environment, sorry. I 
> basically run this in a guest operating system (in a virtualbox virtual 
> machine), assigned 1 CPU core, and 1.5GB of memory. Then the host operating 
> system is OS X 10.6.8 running on alubook (macbook late 2008 model) with 4GB of 
> memory.
>> 
>> 
>>     $ cat /etc/*-release
>>     DISTRIB_ID=Ubuntu
>>     DISTRIB_RELEASE=11.04
>>     DISTRIB_CODENAME=natty
>>     DISTRIB_DESCRIPTION="Ubuntu 11.04"
>>     $ uname -a
>>     Linux sensei 2.6.38-10-generic #46-Ubuntu SMP Tue Jun 28 15:05:41 UTC 
> 2011 i686 i686 i386 GNU/Linux
>> 
>> 
>> Best wishes,
>> Jeffrey04
>> 
>>> ________________________________
>>> From: Jeff Eastman <jeastman@Narus.com>
>>> To: "user@mahout.apache.org" <user@mahout.apache.org>; 
> Jeffrey <mycyberpet@yahoo.com>
>>> Sent: Thursday, July 21, 2011 11:54 PM
>>> Subject: RE: fkmeans or Cluster Dumper not working?
>>> 
>>> Excellent, so this appears to be localized to fuzzyk. Unfortunately, the 
> Apache mail server strips off attachments so you'd need another mechanism (a 
> JIRA?) to upload your data if it is not too large. Some more questions in the 
> interim:
>>> 
>>> - What is the cardinality of your vector data?
>>> - Is it sparse or dense?
>>> - How many vectors are you trying to cluster?
>>> - What is the exact error you see when fkmeans fails with k=10? With 
> k=50?
>>> - What are the Hadoop heap settings you are using for your job?
>>> 
>>> -----Original Message-----
>>> From: Jeffrey [mailto:mycyberpet@yahoo.com]
>>> Sent: Thursday, July 21, 2011 11:21 AM
>>> To: user@mahout.apache.org
>>> Subject: Re: fkmeans or Cluster Dumper not
> working?
>>> 
>>> Hi Jeff,
>>> 
>>> Q: Did you change your invocation to specify a different -c directory 
> (e.g. clusters-0)?
>>> A: Yes :)
>>> 
>>> Q: Did you add the -cl argument?
>>> A: Yes :)
>>> 
>>> $ bin/mahout fkmeans --input sensei/image-tag.arff.mvc --output 
> sensei/clusters --clusters sensei/clusters/clusters-0 --clustering --overwrite 
> --emitMostLikely false --numClusters 5 --maxIter 10 --m 5
>>> $ bin/mahout fkmeans --input sensei/image-tag.arff.mvc --output 
> sensei/clusters --clusters sensei/clusters/clusters-0 --clustering --overwrite 
> --emitMostLikely false --numClusters 10 --maxIter 10 --m 5
>>> $ bin/mahout fkmeans --input sensei/image-tag.arff.mvc --output 
> sensei/clusters --clusters sensei/clusters/clusters-0 --clustering --overwrite 
> --emitMostLikely false --numClusters 50 --maxIter 10 --m 5
>>> 
>>> Q: What is the new CLI invocation for clusterdump?
>>> A:
>>> $ bin/mahout clusterdump --seqFileDir sensei/clusters/clusters-4 
> --pointsDir
> sensei/clusters/clusteredPoints --output image-tag-clusters.txt
>>> 
>>> 
>>> Q: Did this work for -k 10? What happens with -k 50?
>>> A: works for k=5 (but i don't see the points), but not k=10, fkmeans 
> fails when k=50, so i can't dump when k=50
>>> 
>>> Q: Have you tried kmeans?
>>> A: Yes (all tested on 0.6-snapshot)
>>> 
>>> k=5: no problem :)
>>> k=10: no problem :)
>>> k=50: no problem :)
>>> 
>>> p/s: attached with the test data i used (in mvc format), let me know if 
> you guys prefer raw data in arff format
>>> 
>>> Best wishes,
>>> Jeffrey04
>>> 
>>> 
>>> 
>>>> ________________________________
>>>> From: Jeff Eastman <jeastman@Narus.com>
>>>> To: "user@mahout.apache.org" 
> <user@mahout.apache.org>; Jeffrey <mycyberpet@yahoo.com>
>>>> Sent: Thursday, July 21, 2011 9:36 PM
>>>> Subject: RE: fkmeans or Cluster Dumper not working?
>>>> 
>>>> You are correct, the wiki for fkmeans did not mention the -cl 
> argument. I've added that just now. I think this is what Frank means in his 
> comment but you do *not* have to write any custom code to get the cluster dumper 
> to do what you want, just use the -cl argument and specify clusteredPoints as 
> the -p input to clusterdump.
>>>> 
>>>> Check out TestClusterDumper.testKmeans and .testFuzzyKmeans. These 
> show how to invoke the clustering and cluster dumper from Java at least.
>>>> 
>>>> Did you change your invocation to specify a different -c directory 
> (e.g. clusters-0)?
>>>> Did you add the -cl argument?
>>>> What is the new CLI invocation for clusterdump?
>>>> Did this work for -k 10? What happens with -k
> 50?
>>>> Have you tried kmeans?
>>>> 
>>>> I can help you better if you will give me answers to my questions
>>>> 
>>>> -----Original Message-----
>>>> From: Jeffrey [mailto:mycyberpet@yahoo.com]
>>>> Sent: Thursday, July 21, 2011 4:30 AM
>>>> To: user@mahout.apache.org
>>>> Subject: Re: fkmeans or Cluster Dumper not working?
>>>> 
>>>> Hi again,
>>>> 
>>>> Let me update on what's working and what's not working.
>>>> 
>>>> Works:
>>>> fkmeans clustering (10 clusters) - thanks Jeff for the --cl tip
>>>> fkmeans clustering (5 clusters)
>>>> clusterdump (5 clusters) - so points are not included in the 
> clusterdump and I need to write a program for it?
>>>> 
>>>> Not Working:
>>>> fkmeans clustering (50 clusters) - same error
>>>> clusterdump (10
> clusters) - same error
>>>> 
>>>> 
>>>> so it seems to attach points to the cluster dumper output like the 
> synthetic control example does, i would have to write some code as pointed by 
> @Frank_Scholten ? https://twitter.com/#!/Frank_Scholten/status/93617269296472064
>>>> 
>>>> Best wishes,
>>>> Jeffrey04
>>>> 
>>>>> ________________________________
>>>>> From: Jeff Eastman <jeastman@Narus.com>
>>>>> To: "user@mahout.apache.org" 
> <user@mahout.apache.org>; Jeffrey <mycyberpet@yahoo.com>
>>>>> Sent: Wednesday, July 20, 2011 11:53 PM
>>>>> Subject: RE: fkmeans or Cluster Dumper not working?
>>>>> 
>>>>> Hi Jeffrey,
>>>>> 
>>>>> It is always difficult to debug remotely, but here are some 
> suggestions:
>>>>> - First, you are specifying both an input clusters directory 
> --clusters and --numClusters clusters so the job is sampling 10 points from your 
> input data set and writing them to clusteredPoints as the prior clusters for the 
> first iteration. You should pick a different name for this directory, as the 
> clusteredPoints directory is used by the -cl (--clustering) option (which you 
> did not supply) to write out the clustered (classified) input vectors. When you 
> subsequently supplied clusteredPoints to the clusterdumper it was expecting a 
> different format and that caused the exception you saw. Change your --clusters 
> directory (clusters-0 is good)
> and add a -cl argument and things should go more smoothly. The -cl option is not 
> the default and so no clustering of the input points is performed without this 
> (Many people get caught by this and perhaps the default should be changed, but 
> clustering can be expensive and so it is not performed without request).
>>>>> - If you still have problems, try again with k-means. The 
> similarity to fkmeans is good and it will eliminate fkmeans itself if you see 
> the same problems with k-means
>>>>> - I don't see why changing the -k argument from 10 to 50 
> should cause any problems, unless your vectors are very large and you are 
> getting an OME in the reducer. Since the reducer is calculating centroid vectors 
> for the next iteration these will become more dense and memory will increase 
> substantially.
>>>>> - I can't figure out what might be causing your second 
> exception. It is bombing inside of Hadoop file IO and this causes me to suspect 
> command argument
> problems.
>>>>> 
>>>>> Hope this helps,
>>>>> Jeff
>>>>> 
>>>>> 
>>>>> -----Original Message-----
>>>>> From: Jeffrey [mailto:mycyberpet@yahoo.com]
>>>>> Sent: Wednesday, July 20, 2011 2:41 AM
>>>>> To: user@mahout.apache.org
>>>>> Subject: fkmeans or Cluster Dumper not working?
>>>>> 
>>>>> Hi,
>>>>> 
>>>>> I am trying to generate clusters using the fkmeans command line 
> tool from my test data. Not sure if this is correct, as it only runs one 
> iteration (output from 0.6-snapshot, gotta use some workaround to some weird bug 
> - 
> http://search.lucidimagination.com/search/document/d95ff0c29ac4a8a7/bug_in_fkmeans 
> )
>>>>> 
>>>>> $ bin/mahout fkmeans --input sensei/image-tag.arff.mvc --output 
> sensei/clusters --clusters sensei/clusteredPoints --maxIter 10 --numClusters 10 
> --overwrite --m 5
>>>>> Running on hadoop, using 
> HADOOP_HOME=/home/jeffrey04/Applications/hadoop-0.20.203.0HADOOP_CONF_DIR=/home/jeffrey04/Applications/hadoop-0.20.203.0/confMAHOUT-JOB:

> /home/jeffrey04/Applications/mahout/examples/target/mahout-examples-0.6-SNAPSHOT-job.jar11/07/20

> 14:05:18 INFO common.AbstractJob: Command line arguments: 
> {--clusters=sensei/clusteredPoints, --convergenceDelta=0.5, 
> --distanceMeasure=org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure,

> --emitMostLikely=true, --endPhase=2147483647, --input=sensei/image-tag.arff.mvc, 
> --m=5, --maxIter=10, --method=mapreduce, --numClusters=10, 
> --output=sensei/clusters, --overwrite=null, --startPhase=0, --tempDir=temp, 
> --threshold=0}11/07/20 14:05:20 INFO common.HadoopUtil: Deleting 
> sensei/clusters11/07/20
> 14:05:20 INFO common.HadoopUtil: Deleting sensei/clusteredPoints11/07/20 
> 14:05:20 INFO util.NativeCodeLoader: Loaded the native-hadoop library11/07/20 
> 14:05:20 INFO zlib.ZlibFactory: Successfully
>>>>> loaded & initialized native-zlib library11/07/20 14:05:20 
> INFO compress.CodecPool: Got brand-new compressor11/07/20 14:05:20 INFO 
> compress.CodecPool: Got brand-new decompressor
>>>>> 11/07/20 14:05:29 INFO kmeans.RandomSeedGenerator: Wrote 10 
> vectors to sensei/clusteredPoints/part-randomSeed
>>>>> 11/07/20 14:05:29 INFO fuzzykmeans.FuzzyKMeansDriver: Fuzzy 
> K-Means Iteration 1
>>>>> 11/07/20 14:05:30 INFO input.FileInputFormat: Total input paths 
> to process : 1
>>>>> 11/07/20 14:05:30 INFO mapred.JobClient: Running job: 
> job_201107201152_0021
>>>>> 11/07/20 14:05:31 INFO mapred.JobClient:  map 0% reduce 0%
>>>>> 11/07/20 14:05:54 INFO mapred.JobClient:  map 2% reduce 0%
>>>>> 11/07/20 14:05:57 INFO
> mapred.JobClient:  map 5% reduce 0%
>>>>> 11/07/20 14:06:00 INFO mapred.JobClient:  map 6% reduce 0%
>>>>> 11/07/20 14:06:03 INFO mapred.JobClient:  map 7% reduce 0%
>>>>> 11/07/20 14:06:07 INFO mapred.JobClient:  map 10% reduce 0%
>>>>> 11/07/20 14:06:10 INFO mapred.JobClient:  map 13% reduce 0%
>>>>> 11/07/20 14:06:13 INFO mapred.JobClient:  map 15% reduce 0%
>>>>> 11/07/20 14:06:16 INFO mapred.JobClient:  map 17% reduce 0%
>>>>> 11/07/20 14:06:19 INFO mapred.JobClient:  map 19% reduce 0%
>>>>> 11/07/20 14:06:22 INFO mapred.JobClient:  map 23% reduce 0%
>>>>> 11/07/20 14:06:25 INFO mapred.JobClient:  map 25% reduce 0%
>>>>> 11/07/20 14:06:28 INFO mapred.JobClient:  map 27% reduce 0%
>>>>> 11/07/20 14:06:31 INFO mapred.JobClient:  map 30% reduce 0%
>>>>> 11/07/20 14:06:34 INFO mapred.JobClient:  map 33% reduce
> 0%
>>>>> 11/07/20 14:06:37 INFO mapred.JobClient:  map 36% reduce 0%
>>>>> 11/07/20 14:06:40 INFO mapred.JobClient:  map 37% reduce 0%
>>>>> 11/07/20 14:06:43 INFO mapred.JobClient:  map 40% reduce 0%
>>>>> 11/07/20 14:06:46 INFO mapred.JobClient:  map 43% reduce 0%
>>>>> 11/07/20 14:06:49 INFO mapred.JobClient:  map 46% reduce 0%
>>>>> 11/07/20 14:06:52 INFO mapred.JobClient:  map 48% reduce 0%
>>>>> 11/07/20 14:06:55 INFO mapred.JobClient:  map 50% reduce 0%
>>>>> 11/07/20 14:06:57 INFO mapred.JobClient:  map 53% reduce 0%
>>>>> 11/07/20 14:07:00 INFO mapred.JobClient:  map 56% reduce 0%
>>>>> 11/07/20 14:07:03 INFO mapred.JobClient:  map 58% reduce 0%
>>>>> 11/07/20 14:07:06 INFO mapred.JobClient:  map 60% reduce 0%
>>>>> 11/07/20 14:07:09 INFO mapred.JobClient:  map 63% reduce 0%
>>>>> 11/07/20 14:07:13 INFO
> mapred.JobClient:  map 65% reduce 0%
>>>>> 11/07/20 14:07:16 INFO mapred.JobClient:  map 67% reduce 0%
>>>>> 11/07/20 14:07:19 INFO mapred.JobClient:  map 70% reduce 0%
>>>>> 11/07/20 14:07:22 INFO mapred.JobClient:  map 73% reduce 0%
>>>>> 11/07/20 14:07:25 INFO mapred.JobClient:  map 75% reduce 0%
>>>>> 11/07/20 14:07:28 INFO mapred.JobClient:  map 77% reduce 0%
>>>>> 11/07/20 14:07:31 INFO mapred.JobClient:  map 80% reduce 0%
>>>>> 11/07/20 14:07:34 INFO mapred.JobClient:  map 83% reduce 0%
>>>>> 11/07/20 14:07:37 INFO mapred.JobClient:  map 85% reduce 0%
>>>>> 11/07/20 14:07:40 INFO mapred.JobClient:  map 87% reduce 0%
>>>>> 11/07/20 14:07:43 INFO mapred.JobClient:  map 89% reduce 0%
>>>>> 11/07/20 14:07:46 INFO mapred.JobClient:  map 92% reduce 0%
>>>>> 11/07/20 14:07:49 INFO mapred.JobClient:  map 95% reduce
> 0%
>>>>> 11/07/20 14:07:55 INFO mapred.JobClient:  map 98% reduce 0%
>>>>> 11/07/20 14:07:59 INFO mapred.JobClient:  map 99% reduce 0%
>>>>> 11/07/20 14:08:02 INFO mapred.JobClient:  map 100% reduce 0%
>>>>> 11/07/20 14:08:23 INFO mapred.JobClient:  map 100% reduce 100%
>>>>> 11/07/20 14:08:31 INFO mapred.JobClient: Job complete: 
> job_201107201152_0021
>>>>> 11/07/20 14:08:31 INFO mapred.JobClient: Counters: 26
>>>>> 11/07/20 14:08:31 INFO mapred.JobClient:   Job Counters
>>>>> 11/07/20 14:08:31 INFO mapred.JobClient:     Launched reduce 
> tasks=1
>>>>> 11/07/20 14:08:31 INFO mapred.JobClient:    
> SLOTS_MILLIS_MAPS=149314
>>>>> 11/07/20 14:08:31 INFO mapred.JobClient:     Total time spent by 
> all reduces waiting after reserving slots (ms)=0
>>>>> 11/07/20 14:08:31 INFO mapred.JobClient:     Total time spent by 
> all maps waiting after
> reserving slots (ms)=0
>>>>> 11/07/20 14:08:31 INFO mapred.JobClient:     Launched map 
> tasks=1
>>>>> 11/07/20 14:08:31 INFO mapred.JobClient:     Data-local map 
> tasks=1
>>>>> 11/07/20 14:08:31 INFO mapred.JobClient:    
> SLOTS_MILLIS_REDUCES=15618
>>>>> 11/07/20 14:08:31 INFO mapred.JobClient:   File Output Format 
> Counters
>>>>> 11/07/20 14:08:31 INFO mapred.JobClient:     Bytes 
> Written=2247222
>>>>> 11/07/20 14:08:31 INFO mapred.JobClient:   Clustering
>>>>> 11/07/20 14:08:31 INFO mapred.JobClient:     Converged 
> Clusters=10
>>>>> 11/07/20 14:08:31 INFO mapred.JobClient:   FileSystemCounters
>>>>> 11/07/20 14:08:31 INFO mapred.JobClient:    
> FILE_BYTES_READ=130281382
>>>>> 11/07/20 14:08:31 INFO mapred.JobClient:    
> HDFS_BYTES_READ=254494
>>>>> 11/07/20 14:08:31 INFO mapred.JobClient:
> FILE_BYTES_WRITTEN=132572666
>>>>> 11/07/20 14:08:31 INFO mapred.JobClient:    
> HDFS_BYTES_WRITTEN=2247222
>>>>> 11/07/20 14:08:31 INFO mapred.JobClient:   File Input Format 
> Counters
>>>>> 11/07/20 14:08:31 INFO mapred.JobClient:     Bytes Read=247443
>>>>> 11/07/20 14:08:31 INFO mapred.JobClient:   Map-Reduce Framework
>>>>> 11/07/20 14:08:31 INFO mapred.JobClient:     Reduce input 
> groups=10
>>>>> 11/07/20 14:08:31 INFO mapred.JobClient:     Map output 
> materialized bytes=2246233
>>>>> 11/07/20 14:08:32 INFO mapred.JobClient:     Combine output 
> records=330
>>>>> 11/07/20 14:08:32 INFO mapred.JobClient:     Map input 
> records=1113
>>>>> 11/07/20 14:08:32 INFO mapred.JobClient:     Reduce shuffle 
> bytes=2246233
>>>>> 11/07/20 14:08:32 INFO mapred.JobClient:     Reduce output 
> records=10
>>>>> 11/07/20 14:08:32 INFO
> mapred.JobClient:     Spilled Records=590
>>>>> 11/07/20 14:08:32 INFO mapred.JobClient:     Map output 
> bytes=2499995001
>>>>> 11/07/20 14:08:32 INFO mapred.JobClient:     Combine input 
> records=11450
>>>>> 11/07/20 14:08:32 INFO mapred.JobClient:     Map output 
> records=11130
>>>>> 11/07/20 14:08:32 INFO mapred.JobClient:     SPLIT_RAW_BYTES=127
>>>>> 11/07/20 14:08:32 INFO mapred.JobClient:     Reduce input 
> records=10
>>>>> 11/07/20 14:08:32 INFO driver.MahoutDriver: Program took 194096 
> ms
>>>>> 
>>>>> if I increase the --numClusters argument (e.g. 50), then it will 
> return exception after
>>>>> 11/07/20 14:08:02 INFO mapred.JobClient:  map 100% reduce 0%
>>>>> 
>>>>> and would retry again (also reproducible using 0.6-snapshot)
>>>>> 
>>>>> ...
>>>>> 11/07/20 14:22:25 INFO mapred.JobClient:  map 100% reduce
> 0%
>>>>> 11/07/20 14:22:30 INFO mapred.JobClient: Task Id : 
> attempt_201107201152_0022_m_000000_0, Status : FAILED
>>>>> org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not 
> find any valid local directory for output/file.out
>>>>>         at 
> org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:381)
>>>>>         at 
> org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:146)
>>>>>         at 
> org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:127)
>>>>>         at 
> org.apache.hadoop.mapred.MapOutputFile.getOutputFileForWrite(MapOutputFile.java:69)
>>>>>         at 
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:1639)
>>>>>         at
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1322)
>>>>>         at 
> org.apache.hadoop.mapred.MapTask$NewOutputCollector.close(MapTask.java:698)
>>>>>         at 
> org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:765)
>>>>>         at 
> org.apache.hadoop.mapred.MapTask.run(MapTask.java:369)
>>>>>         at org.apache.hadoop.mapred.Child$4.run(Child.java:259)
>>>>>         at java.security.AccessController.doPrivileged(Native 
> Method)
>>>>>         at javax.security.auth.Subject.doAs(Subject.java:416)
>>>>>         at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
>>>>>         at org.apache.hadoop.mapred.Child.main(Child.java:253)
>>>>> 
>>>>> 11/07/20 14:22:32 INFO
> mapred.JobClient:  map 0% reduce 0%
>>>>> ...
>>>>> 
>>>>> Then I ran cluster dumper to dump information about the 
> clusters, this command would work if I only care about the cluster centroids 
> (both 0.5 release and 0.6-snapshot)
>>>>> 
>>>>> $ bin/mahout clusterdump --seqFileDir sensei/clusters/clusters-1 
> --output image-tag-clusters.txt
>>>>> Running on hadoop, using 
> HADOOP_HOME=/home/jeffrey04/Applications/hadoop-0.20.203.0
>>>>> HADOOP_CONF_DIR=/home/jeffrey04/Applications/hadoop-0.20.203.0/conf
>>>>> MAHOUT-JOB: 
> /home/jeffrey04/Applications/mahout/examples/target/mahout-examples-0.6-SNAPSHOT-job.jar
>>>>> 11/07/20 14:33:45 INFO common.AbstractJob: Command line 
> arguments: {--dictionaryType=text, --endPhase=2147483647, 
> --output=image-tag-clusters.txt, --seqFileDir=sensei/clusters/clusters-1, 
> --startPhase=0, --tempDir=temp}
>>>>> 11/07/20 14:33:56 INFO driver.MahoutDriver: Program took 11761
> ms
>>>>> 
>>>>> but if I want to see the degree of membership of each points, I 
> get another exception (yes, reproducible for both 0.5 release and 0.6-snapshot)
>>>>> 
>>>>> $ bin/mahout clusterdump --seqFileDir sensei/clusters/clusters-1 
> --output image-tag-clusters.txt --pointsDir sensei/clusteredPoints
>>>>> Running on hadoop, using 
> HADOOP_HOME=/home/jeffrey04/Applications/hadoop-0.20.203.0
>>>>> HADOOP_CONF_DIR=/home/jeffrey04/Applications/hadoop-0.20.203.0/conf
>>>>> MAHOUT-JOB: 
> /home/jeffrey04/Applications/mahout/examples/target/mahout-examples-0.6-SNAPSHOT-job.jar
>>>>> 11/07/20 14:35:08 INFO common.AbstractJob: Command line 
> arguments: {--dictionaryType=text, --endPhase=2147483647, 
> --output=image-tag-clusters.txt, --pointsDir=sensei/clusteredPoints, 
> --seqFileDir=sensei/clusters/clusters-1, --startPhase=0, --tempDir=temp}
>>>>> 11/07/20 14:35:10 INFO util.NativeCodeLoader: Loaded the 
> native-hadoop
> library
>>>>> 11/07/20 14:35:10 INFO zlib.ZlibFactory: Successfully loaded 
> & initialized native-zlib library
>>>>> 11/07/20 14:35:10 INFO compress.CodecPool: Got brand-new 
> decompressor
>>>>> Exception in thread "main" 
> java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast to 
> org.apache.hadoop.io.IntWritable
>>>>>         at 
> org.apache.mahout.utils.clustering.ClusterDumper.readPoints(ClusterDumper.java:261)
>>>>>         at 
> org.apache.mahout.utils.clustering.ClusterDumper.init(ClusterDumper.java:209)
>>>>>         at 
> org.apache.mahout.utils.clustering.ClusterDumper.run(ClusterDumper.java:123)
>>>>>         at 
> org.apache.mahout.utils.clustering.ClusterDumper.main(ClusterDumper.java:89)
>>>>>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native 
> Method)
>>>>> 
>    at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>>>>>         at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>>>>         at java.lang.reflect.Method.invoke(Method.java:616)
>>>>>         at 
> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
>>>>>         at 
> org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
>>>>>         at 
> org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:188)
>>>>>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native 
> Method)
>>>>>         at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>>>>>         at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>>>>         at java.lang.reflect.Method.invoke(Method.java:616)
>>>>>         at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
>>>>> 
>>>>> erm, would writing a short program to call the API (btw, 
> can't seem to find the latest API doc?) be a better choice here? Or did I do 
> anything wrong here (yes, Java is not my main language, and I am very new to 
> Mahout.. and h)?
>>>>> 
>>>>> the data is converted from an arff file with about 1000 rows 
> (resource) and 14k columns (tag), and it is just a subset of my data. (actually 
> made a mistake so it is now generating resource clusters instead of tag 
> clusters, but I am just doing this as a proof of concept whether mahout is good 
> enough for the task)
>>>>> 
>>>>> Best
> wishes,
>>>>> Jeffrey04
>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> 
>>> 
>>> 
>>> 
>> 
>> 
>

Mime
View raw message