mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Periya.Data" <periya.d...@gmail.com>
Subject Re: number of clusters (Canopy Clustering)
Date Thu, 05 Jan 2012 04:44:55 GMT
Hi Paritosh,
    Thanks for your suggestions. I am currently trying to use Canopy
Clustering to guess the number of clusters. I have tried various values
(between 0 and 1) for t1 and t2 (t1 > t2). Still I get only one cluster. I
tried (0.9, 0.2), (0.05, 0.001), (0.005, 0.00001) etc. I thought if I make
t2 very close to 0, I would a lot of clusters...but, it is very strange...I
am getting only one cluster for a vast set of t1/t2 values.

Is this because I am using just one text file for my analysis?

I have only one large text file and want to cluster the words and see how
they are clustered. I thought this would be a simple way to begin exploring
clustering/mahout.

Your suggestions are appreciated,
PD.

On Sat, Dec 31, 2011 at 2:48 AM, Paritosh Ranjan <pranjan@xebia.com> wrote:

> There can be two reasons for only one cluster being found.
>
> 1) The vectors are really close to each other and the clusters converge.
> 2) The distance measure you are using is not appropriate with your vector
> values.
>
> Try to
> 1) Analyze the vectors and the distance between them. Are they good
> candidates to be inside different clusters?
> 2) Try to use CanopyClustering first to guess the number of clusters (
> experiment a bit by changing values of t1 and t2 ).
> 3) Then provided the clusters returned by CanopyClustering to KMeans.
> 4) Use EuclideanDistance instead of Squared...
>
> Paritosh
>
> ________________________________________
> From: Periya.Data [periya.data@gmail.com]
> Sent: Saturday, December 31, 2011 1:07 AM
> To: user@mahout.apache.org
> Subject: number of clusters
>
> Hi all,
>    I am a newbie to Mahout. I am running a basic k-means clustering on a
> sample txt file. No matter what number I give to the --numClusters
> parameter, I always get only one cluster (VL-0). Can someone please point
> out any mistake and suggest what I should do to see a decent number of
> clusters?
>
> I successfully convert the txt file into seq-file and then to vectorized
> format.
>
> The command I use is the following:
>
> $MAHOUT_HOME/bin/mahout kmeans       --input
> /input/mahout/vectorized/tfidf-vectors \
>                        --output           $HDFS_OUTPUT_DIR/clusters \
>                        --clusters         $HDFS_OUTPUT_DIR/initialclusters
> \
>                        --maxIter          10 \
>                        --numClusters      20 \
>                        --clustering       \
>                        --overwrite
>
>
> Here is the console output:
> =====================
>
> pd@PeriyaData:~/Mahout/examples/bin$ ./bigdata_kmeans.sh
> MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
> Running on hadoop, using HADOOP_HOME=/home/pd/CDH3/hadoop
> HADOOP_CONF_DIR=/home/pd/CDH3/hadoop/conf
> MAHOUT-JOB:
> /home/pd/CDH3/mahout/examples/target/mahout-examples-0.6-SNAPSHOT-job.jar
> 11/12/30 15:59:23 INFO common.AbstractJob: Command line arguments:
> {--clustering=null, --clusters=/output/mahout/kmeans/initialclusters,
> --convergenceDelta=0.5,
>
> --distanceMeasure=org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure,
> --endPhase=2147483647, --input=/input/mahout/vectorized/tfidf-vectors,
> --maxIter=10, --method=mapreduce, --numClusters=20,
> --output=/output/mahout/kmeans/clusters, --overwrite=null, --startPhase=0,
> --tempDir=temp}
> 11/12/30 15:59:23 INFO common.HadoopUtil: Deleting
> /output/mahout/kmeans/clusters
> 11/12/30 15:59:24 INFO common.HadoopUtil: Deleting
> /output/mahout/kmeans/initialclusters
> 11/12/30 15:59:25 INFO util.NativeCodeLoader: Loaded the native-hadoop
> library
> 11/12/30 15:59:25 INFO zlib.ZlibFactory: Successfully loaded & initialized
> native-zlib library
> 11/12/30 15:59:25 INFO compress.CodecPool: Got brand-new compressor
> 11/12/30 15:59:25 INFO kmeans.RandomSeedGenerator: Wrote 20 vectors to
> /output/mahout/kmeans/initialclusters/part-randomSeed
> 11/12/30 15:59:25 INFO kmeans.KMeansDriver: Input:
> /input/mahout/vectorized/tfidf-vectors Clusters In:
> /output/mahout/kmeans/initialclusters/part-randomSeed Out:
> /output/mahout/kmeans/clusters Distance:
> org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure
> 11/12/30 15:59:25 INFO kmeans.KMeansDriver: convergence: 0.5 max
> Iterations: 10 num Reduce Tasks: org.apache.mahout.math.VectorWritable
> Input Vectors: {}
> 11/12/30 15:59:25 INFO kmeans.KMeansDriver: K-Means Iteration 1
> 11/12/30 15:59:25 INFO input.FileInputFormat: Total input paths to process
> : 1
> 11/12/30 15:59:26 INFO mapred.JobClient: Running job: job_201112301129_0029
> 11/12/30 15:59:27 INFO mapred.JobClient:  map 0% reduce 0%
> 11/12/30 15:59:30 INFO mapred.JobClient:  map 100% reduce 0%
> 11/12/30 15:59:39 INFO mapred.JobClient:  map 100% reduce 100%
> 11/12/30 15:59:39 INFO mapred.JobClient: Job complete:
> job_201112301129_0029
> 11/12/30 15:59:39 INFO mapred.JobClient: Counters: 23
> 11/12/30 15:59:39 INFO mapred.JobClient:   Job Counters
> 11/12/30 15:59:39 INFO mapred.JobClient:     Launched reduce tasks=1
> 11/12/30 15:59:39 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=3074
> 11/12/30 15:59:39 INFO mapred.JobClient:     Total time spent by all
> reduces waiting after reserving slots (ms)=0
> 11/12/30 15:59:39 INFO mapred.JobClient:     Total time spent by all maps
> waiting after reserving slots (ms)=0
> 11/12/30 15:59:39 INFO mapred.JobClient:     Launched map tasks=1
> 11/12/30 15:59:39 INFO mapred.JobClient:     Data-local map tasks=1
> 11/12/30 15:59:39 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=8299
> 11/12/30 15:59:39 INFO mapred.JobClient:   Clustering
> 11/12/30 15:59:39 INFO mapred.JobClient:     Converged Clusters=1
> 11/12/30 15:59:39 INFO mapred.JobClient:   FileSystemCounters
> 11/12/30 15:59:39 INFO mapred.JobClient:     FILE_BYTES_READ=185593
> 11/12/30 15:59:39 INFO mapred.JobClient:     HDFS_BYTES_READ=139801
> 11/12/30 15:59:39 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=477505
> 11/12/30 15:59:39 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=92991
> 11/12/30 15:59:39 INFO mapred.JobClient:   Map-Reduce Framework
> 11/12/30 15:59:39 INFO mapred.JobClient:     Reduce input groups=1
> 11/12/30 15:59:39 INFO mapred.JobClient:     Combine output records=1
> 11/12/30 15:59:39 INFO mapred.JobClient:     Map input records=1
> 11/12/30 15:59:39 INFO mapred.JobClient:     Reduce shuffle bytes=0
> 11/12/30 15:59:39 INFO mapred.JobClient:     Reduce output records=1
> 11/12/30 15:59:39 INFO mapred.JobClient:     Spilled Records=2
> 11/12/30 15:59:39 INFO mapred.JobClient:     Map output bytes=185582
> 11/12/30 15:59:39 INFO mapred.JobClient:     Combine input records=1
> 11/12/30 15:59:39 INFO mapred.JobClient:     Map output records=1
> 11/12/30 15:59:39 INFO mapred.JobClient:     SPLIT_RAW_BYTES=137
> 11/12/30 15:59:39 INFO mapred.JobClient:     Reduce input records=1
> 11/12/30 15:59:39 INFO kmeans.KMeansDriver: Clustering data
> 11/12/30 15:59:39 INFO kmeans.KMeansDriver: Running Clustering
> 11/12/30 15:59:39 INFO kmeans.KMeansDriver: Input:
> /input/mahout/vectorized/tfidf-vectors Clusters In:
> /output/mahout/kmeans/clusters/clusters-1-final Out:
> /output/mahout/kmeans/clusters/clusteredPoints Distance:
> org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure@14e4e31
> 11/12/30 15:59:39 INFO kmeans.KMeansDriver: convergence: 0.5 Input Vectors:
> org.apache.mahout.math.VectorWritable
> 11/12/30 15:59:40 INFO input.FileInputFormat: Total input paths to process
> : 1
> 11/12/30 15:59:40 INFO mapred.JobClient: Running job: job_201112301129_0030
> 11/12/30 15:59:41 INFO mapred.JobClient:  map 0% reduce 0%
> 11/12/30 15:59:45 INFO mapred.JobClient:  map 100% reduce 0%
> 11/12/30 15:59:45 INFO mapred.JobClient: Job complete:
> job_201112301129_0030
> 11/12/30 15:59:45 INFO mapred.JobClient: Counters: 13
> 11/12/30 15:59:45 INFO mapred.JobClient:   Job Counters
> 11/12/30 15:59:45 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=3815
> 11/12/30 15:59:45 INFO mapred.JobClient:     Total time spent by all
> reduces waiting after reserving slots (ms)=0
> 11/12/30 15:59:45 INFO mapred.JobClient:     Total time spent by all maps
> waiting after reserving slots (ms)=0
> 11/12/30 15:59:45 INFO mapred.JobClient:     Launched map tasks=1
> 11/12/30 15:59:45 INFO mapred.JobClient:     Data-local map tasks=1
> 11/12/30 15:59:45 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=0
> 11/12/30 15:59:45 INFO mapred.JobClient:   FileSystemCounters
> 11/12/30 15:59:45 INFO mapred.JobClient:     HDFS_BYTES_READ=186054
> 11/12/30 15:59:45 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=52059
> 11/12/30 15:59:45 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=92956
> 11/12/30 15:59:45 INFO mapred.JobClient:   Map-Reduce Framework
> 11/12/30 15:59:45 INFO mapred.JobClient:     Map input records=1
> 11/12/30 15:59:45 INFO mapred.JobClient:     Spilled Records=0
> 11/12/30 15:59:45 INFO mapred.JobClient:     Map output records=1
> 11/12/30 15:59:45 INFO mapred.JobClient:     SPLIT_RAW_BYTES=137
> 11/12/30 15:59:45 INFO driver.MahoutDriver: Program took 21888 ms (Minutes:
> 0.3648)
> MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
> Running on hadoop, using HADOOP_HOME=/home/pd/CDH3/hadoop
> HADOOP_CONF_DIR=/home/pd/CDH3/hadoop/conf
> MAHOUT-JOB:
> /home/pd/CDH3/mahout/examples/target/mahout-examples-0.6-SNAPSHOT-job.jar
> 11/12/30 15:59:48 INFO common.AbstractJob: Command line arguments:
> {--dictionary=/input/mahout/vectorized/dictionary.file-0,
> --dictionaryType=sequencefile,
>
> --distanceMeasure=org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure,
> --endPhase=2147483647, --numWords=30,
> --output=/home/pd/Mahout/examples/output/clusteranalyze.txt,
> --outputFormat=TEXT,
> --pointsDir=/output/mahout/kmeans/clusters/clusteredPoints,
> --seqFileDir=/output/mahout/kmeans/clusters/clusters-*-final,
> --startPhase=0, --tempDir=temp}
> *11/12/30 15:59:49 INFO clustering.ClusterDumper: Wrote 1 clusters*
> 11/12/30 15:59:49 INFO driver.MahoutDriver: Program took 1171 ms (Minutes:
> 0.01951666666666667)
> pd@PeriyaData:~/Mahout/examples/bin$
>
>
> pd@PeriyaData:~/Mahout/examples/bin$ hadoop fs -ls
> /output/mahout/kmeans/clusters
> Found 2 items
> drwxr-xr-x   - pd supergroup          0 2011-12-30 15:59
> /output/mahout/kmeans/clusters/clusteredPoints
> drwxr-xr-x   - pd supergroup          0 2011-12-30 15:59
> /output/mahout/kmeans/clusters/clusters-1-final
> pd@PeriyaData:~/Mahout/rabi/examples/bin$ hadoop fs -ls
> /output/mahout/kmeans/clusters/clusters-1-final
> Found 3 items
> -rw-r--r--   1 pd supergroup          0 2011-12-30 15:59
> /output/mahout/kmeans/clusters/clusters-1-final/_SUCCESS
> drwxr-xr-x   - pd supergroup          0 2011-12-30 15:59
> /output/mahout/kmeans/clusters/clusters-1-final/_logs
> -rw-r--r--   1 pd supergroup      92991 2011-12-30 15:59
> /output/mahout/kmeans/clusters/clusters-1-final/part-r-00000
> pd@PeriyaData:~/Mahout/examples/bin$
>
>
> Thanks,
> PD
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message