mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jeff Eastman <j...@windwardsolutions.com>
Subject Re: number of clusters (Canopy Clustering)
Date Sun, 08 Jan 2012 15:48:03 GMT
I'm almost certain there is no current way to do this from the command 
line. You could write a small utility to do this (see 
CanopyClusterer.buildClustersSeq() for a simple skeleton you could use). 
But I would suggest trying CosineDistanceMeasure instead of Euclidean 
for text. If you have a small number of input files you could run the 
-xm sequential mode in the debugger and breakpoint or just add some 
printouts to CanopyClusterer.addPointToCanopies(...).

On 1/7/12 5:08 AM, Paritosh Ranjan wrote:
> "Is there a way for me to determine the distance from command line? "
>
> I am not aware of any. If anyone else is, then please suggest.
> ________________________________________
> From: Periya.Data [periya.data@gmail.com]
> Sent: Saturday, January 07, 2012 6:31 AM
> To: user@mahout.apache.org
> Subject: Re: number of clusters (Canopy Clustering)
>
> I agree that if all the distances are<  t2, I will get only one cluster. I
> was just "hoping" that they do fall within that range and was basically
> shooting in dark when twiddling with various t1 and t2 values.
>
> Is there an easy way to determine the distance between vectors? In the
> CanopyCluster shell script, I use EuclideanDistanceMeasure. The TFIDF
> vectors are in binary and I have no idea how to proceed.
>
> Is there a way for me to determine the distance from command line? So far,
> I am not using any Java program to do my experiments. As a beginner, I am
> running shell scripts and learning.
>
> $MAHOUT_HOME/bin/mahout canopy       -i
> /input/mahout/vectorized/tfidf-vectors \
>                          -o
> $HDFS_OUTPUT_DIR/bigdata-canopy-centroids \
>                          -dm
> org.apache.mahout.common.distance.EuclideanDistanceMeasure \
>                          -t1          0.9 \
>                          -t2          0.2 \
>                          --overwrite
>
>
>
> Thanks for your suggestions,
> PD.
>
>
> On Thu, Jan 5, 2012 at 8:47 PM, Paritosh Ranjan<pranjan@xebia.com>  wrote:
>
>> What is the distance between vectors with the Distance measure you are
>> using?
>> If all the vectors lie within the range of t2, then you will get only 1
>> cluster.
>>
>> Write some piece of test code which creates vectors of the data you are
>> using, and then find the distance between the vectors ( using the same
>> distance measure you are using while clustering ). If all distances are
>> within t2, then you will get only one cluster.
>>
>>
>> On 05-01-2012 10:14, Periya.Data wrote:
>>
>>> Hi Paritosh,
>>>      Thanks for your suggestions. I am currently trying to use Canopy
>>> Clustering to guess the number of clusters. I have tried various values
>>> (between 0 and 1) for t1 and t2 (t1>   t2). Still I get only one cluster.
I
>>> tried (0.9, 0.2), (0.05, 0.001), (0.005, 0.00001) etc. I thought if I make
>>> t2 very close to 0, I would a lot of clusters...but, it is very
>>> strange...I
>>> am getting only one cluster for a vast set of t1/t2 values.
>>>
>>> Is this because I am using just one text file for my analysis?
>>>
>>> I have only one large text file and want to cluster the words and see how
>>> they are clustered. I thought this would be a simple way to begin
>>> exploring
>>> clustering/mahout.
>>>
>>> Your suggestions are appreciated,
>>> PD.
>>>
>>> On Sat, Dec 31, 2011 at 2:48 AM, Paritosh Ranjan<pranjan@xebia.com>
>>>   wrote:
>>>
>>>   There can be two reasons for only one cluster being found.
>>>> 1) The vectors are really close to each other and the clusters converge.
>>>> 2) The distance measure you are using is not appropriate with your vector
>>>> values.
>>>>
>>>> Try to
>>>> 1) Analyze the vectors and the distance between them. Are they good
>>>> candidates to be inside different clusters?
>>>> 2) Try to use CanopyClustering first to guess the number of clusters (
>>>> experiment a bit by changing values of t1 and t2 ).
>>>> 3) Then provided the clusters returned by CanopyClustering to KMeans.
>>>> 4) Use EuclideanDistance instead of Squared...
>>>>
>>>> Paritosh
>>>>
>>>> ______________________________**__________
>>>> From: Periya.Data [periya.data@gmail.com]
>>>> Sent: Saturday, December 31, 2011 1:07 AM
>>>> To: user@mahout.apache.org
>>>> Subject: number of clusters
>>>>
>>>> Hi all,
>>>>     I am a newbie to Mahout. I am running a basic k-means clustering on a
>>>> sample txt file. No matter what number I give to the --numClusters
>>>> parameter, I always get only one cluster (VL-0). Can someone please point
>>>> out any mistake and suggest what I should do to see a decent number of
>>>> clusters?
>>>>
>>>> I successfully convert the txt file into seq-file and then to vectorized
>>>> format.
>>>>
>>>> The command I use is the following:
>>>>
>>>> $MAHOUT_HOME/bin/mahout kmeans       --input
>>>> /input/mahout/vectorized/**tfidf-vectors \
>>>>                         --output           $HDFS_OUTPUT_DIR/clusters \
>>>>                         --clusters         $HDFS_OUTPUT_DIR/**
>>>> initialclusters
>>>> \
>>>>                         --maxIter          10 \
>>>>                         --numClusters      20 \
>>>>                         --clustering       \
>>>>                         --overwrite
>>>>
>>>>
>>>> Here is the console output:
>>>> =====================
>>>>
>>>> pd@PeriyaData:~/Mahout/**examples/bin$ ./bigdata_kmeans.sh
>>>> MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
>>>> Running on hadoop, using HADOOP_HOME=/home/pd/CDH3/**hadoop
>>>> HADOOP_CONF_DIR=/home/pd/CDH3/**hadoop/conf
>>>> MAHOUT-JOB:
>>>> /home/pd/CDH3/mahout/examples/**target/mahout-examples-0.6-**
>>>> SNAPSHOT-job.jar
>>>> 11/12/30 15:59:23 INFO common.AbstractJob: Command line arguments:
>>>> {--clustering=null, --clusters=/output/mahout/**kmeans/initialclusters,
>>>> --convergenceDelta=0.5,
>>>>
>>>> --distanceMeasure=org.apache.**mahout.common.distance.**
>>>> SquaredEuclideanDistanceMeasur**e,
>>>> --endPhase=2147483647, --input=/input/mahout/**vectorized/tfidf-vectors,
>>>> --maxIter=10, --method=mapreduce, --numClusters=20,
>>>> --output=/output/mahout/**kmeans/clusters, --overwrite=null,
>>>> --startPhase=0,
>>>> --tempDir=temp}
>>>> 11/12/30 15:59:23 INFO common.HadoopUtil: Deleting
>>>> /output/mahout/kmeans/clusters
>>>> 11/12/30 15:59:24 INFO common.HadoopUtil: Deleting
>>>> /output/mahout/kmeans/**initialclusters
>>>> 11/12/30 15:59:25 INFO util.NativeCodeLoader: Loaded the native-hadoop
>>>> library
>>>> 11/12/30 15:59:25 INFO zlib.ZlibFactory: Successfully loaded&
>>>>   initialized
>>>>
>>>> native-zlib library
>>>> 11/12/30 15:59:25 INFO compress.CodecPool: Got brand-new compressor
>>>> 11/12/30 15:59:25 INFO kmeans.RandomSeedGenerator: Wrote 20 vectors to
>>>> /output/mahout/kmeans/**initialclusters/part-**randomSeed
>>>> 11/12/30 15:59:25 INFO kmeans.KMeansDriver: Input:
>>>> /input/mahout/vectorized/**tfidf-vectors Clusters In:
>>>> /output/mahout/kmeans/**initialclusters/part-**randomSeed Out:
>>>> /output/mahout/kmeans/clusters Distance:
>>>> org.apache.mahout.common.**distance.**SquaredEuclideanDistanceMeasur**e
>>>> 11/12/30 15:59:25 INFO kmeans.KMeansDriver: convergence: 0.5 max
>>>> Iterations: 10 num Reduce Tasks: org.apache.mahout.math.**VectorWritable
>>>> Input Vectors: {}
>>>> 11/12/30 15:59:25 INFO kmeans.KMeansDriver: K-Means Iteration 1
>>>> 11/12/30 15:59:25 INFO input.FileInputFormat: Total input paths to
>>>> process
>>>> : 1
>>>> 11/12/30 15:59:26 INFO mapred.JobClient: Running job:
>>>> job_201112301129_0029
>>>> 11/12/30 15:59:27 INFO mapred.JobClient:  map 0% reduce 0%
>>>> 11/12/30 15:59:30 INFO mapred.JobClient:  map 100% reduce 0%
>>>> 11/12/30 15:59:39 INFO mapred.JobClient:  map 100% reduce 100%
>>>> 11/12/30 15:59:39 INFO mapred.JobClient: Job complete:
>>>> job_201112301129_0029
>>>> 11/12/30 15:59:39 INFO mapred.JobClient: Counters: 23
>>>> 11/12/30 15:59:39 INFO mapred.JobClient:   Job Counters
>>>> 11/12/30 15:59:39 INFO mapred.JobClient:     Launched reduce tasks=1
>>>> 11/12/30 15:59:39 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=3074
>>>> 11/12/30 15:59:39 INFO mapred.JobClient:     Total time spent by all
>>>> reduces waiting after reserving slots (ms)=0
>>>> 11/12/30 15:59:39 INFO mapred.JobClient:     Total time spent by all maps
>>>> waiting after reserving slots (ms)=0
>>>> 11/12/30 15:59:39 INFO mapred.JobClient:     Launched map tasks=1
>>>> 11/12/30 15:59:39 INFO mapred.JobClient:     Data-local map tasks=1
>>>> 11/12/30 15:59:39 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=8299
>>>> 11/12/30 15:59:39 INFO mapred.JobClient:   Clustering
>>>> 11/12/30 15:59:39 INFO mapred.JobClient:     Converged Clusters=1
>>>> 11/12/30 15:59:39 INFO mapred.JobClient:   FileSystemCounters
>>>> 11/12/30 15:59:39 INFO mapred.JobClient:     FILE_BYTES_READ=185593
>>>> 11/12/30 15:59:39 INFO mapred.JobClient:     HDFS_BYTES_READ=139801
>>>> 11/12/30 15:59:39 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=477505
>>>> 11/12/30 15:59:39 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=92991
>>>> 11/12/30 15:59:39 INFO mapred.JobClient:   Map-Reduce Framework
>>>> 11/12/30 15:59:39 INFO mapred.JobClient:     Reduce input groups=1
>>>> 11/12/30 15:59:39 INFO mapred.JobClient:     Combine output records=1
>>>> 11/12/30 15:59:39 INFO mapred.JobClient:     Map input records=1
>>>> 11/12/30 15:59:39 INFO mapred.JobClient:     Reduce shuffle bytes=0
>>>> 11/12/30 15:59:39 INFO mapred.JobClient:     Reduce output records=1
>>>> 11/12/30 15:59:39 INFO mapred.JobClient:     Spilled Records=2
>>>> 11/12/30 15:59:39 INFO mapred.JobClient:     Map output bytes=185582
>>>> 11/12/30 15:59:39 INFO mapred.JobClient:     Combine input records=1
>>>> 11/12/30 15:59:39 INFO mapred.JobClient:     Map output records=1
>>>> 11/12/30 15:59:39 INFO mapred.JobClient:     SPLIT_RAW_BYTES=137
>>>> 11/12/30 15:59:39 INFO mapred.JobClient:     Reduce input records=1
>>>> 11/12/30 15:59:39 INFO kmeans.KMeansDriver: Clustering data
>>>> 11/12/30 15:59:39 INFO kmeans.KMeansDriver: Running Clustering
>>>> 11/12/30 15:59:39 INFO kmeans.KMeansDriver: Input:
>>>> /input/mahout/vectorized/**tfidf-vectors Clusters In:
>>>> /output/mahout/kmeans/**clusters/clusters-1-final Out:
>>>> /output/mahout/kmeans/**clusters/clusteredPoints Distance:
>>>> org.apache.mahout.common.**distance.**SquaredEuclideanDistanceMeasur**
>>>> e@14e4e31
>>>> 11/12/30 15:59:39 INFO kmeans.KMeansDriver: convergence: 0.5 Input
>>>> Vectors:
>>>> org.apache.mahout.math.**VectorWritable
>>>> 11/12/30 15:59:40 INFO input.FileInputFormat: Total input paths to
>>>> process
>>>> : 1
>>>> 11/12/30 15:59:40 INFO mapred.JobClient: Running job:
>>>> job_201112301129_0030
>>>> 11/12/30 15:59:41 INFO mapred.JobClient:  map 0% reduce 0%
>>>> 11/12/30 15:59:45 INFO mapred.JobClient:  map 100% reduce 0%
>>>> 11/12/30 15:59:45 INFO mapred.JobClient: Job complete:
>>>> job_201112301129_0030
>>>> 11/12/30 15:59:45 INFO mapred.JobClient: Counters: 13
>>>> 11/12/30 15:59:45 INFO mapred.JobClient:   Job Counters
>>>> 11/12/30 15:59:45 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=3815
>>>> 11/12/30 15:59:45 INFO mapred.JobClient:     Total time spent by all
>>>> reduces waiting after reserving slots (ms)=0
>>>> 11/12/30 15:59:45 INFO mapred.JobClient:     Total time spent by all maps
>>>> waiting after reserving slots (ms)=0
>>>> 11/12/30 15:59:45 INFO mapred.JobClient:     Launched map tasks=1
>>>> 11/12/30 15:59:45 INFO mapred.JobClient:     Data-local map tasks=1
>>>> 11/12/30 15:59:45 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=0
>>>> 11/12/30 15:59:45 INFO mapred.JobClient:   FileSystemCounters
>>>> 11/12/30 15:59:45 INFO mapred.JobClient:     HDFS_BYTES_READ=186054
>>>> 11/12/30 15:59:45 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=52059
>>>> 11/12/30 15:59:45 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=92956
>>>> 11/12/30 15:59:45 INFO mapred.JobClient:   Map-Reduce Framework
>>>> 11/12/30 15:59:45 INFO mapred.JobClient:     Map input records=1
>>>> 11/12/30 15:59:45 INFO mapred.JobClient:     Spilled Records=0
>>>> 11/12/30 15:59:45 INFO mapred.JobClient:     Map output records=1
>>>> 11/12/30 15:59:45 INFO mapred.JobClient:     SPLIT_RAW_BYTES=137
>>>> 11/12/30 15:59:45 INFO driver.MahoutDriver: Program took 21888 ms
>>>> (Minutes:
>>>> 0.3648)
>>>> MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
>>>> Running on hadoop, using HADOOP_HOME=/home/pd/CDH3/**hadoop
>>>> HADOOP_CONF_DIR=/home/pd/CDH3/**hadoop/conf
>>>> MAHOUT-JOB:
>>>> /home/pd/CDH3/mahout/examples/**target/mahout-examples-0.6-**
>>>> SNAPSHOT-job.jar
>>>> 11/12/30 15:59:48 INFO common.AbstractJob: Command line arguments:
>>>> {--dictionary=/input/mahout/**vectorized/dictionary.file-0,
>>>> --dictionaryType=sequencefile,
>>>>
>>>> --distanceMeasure=org.apache.**mahout.common.distance.**
>>>> SquaredEuclideanDistanceMeasur**e,
>>>> --endPhase=2147483647, --numWords=30,
>>>> --output=/home/pd/Mahout/**examples/output/**clusteranalyze.txt,
>>>> --outputFormat=TEXT,
>>>> --pointsDir=/output/mahout/**kmeans/clusters/**clusteredPoints,
>>>> --seqFileDir=/output/mahout/**kmeans/clusters/clusters-*-**final,
>>>> --startPhase=0, --tempDir=temp}
>>>> *11/12/30 15:59:49 INFO clustering.ClusterDumper: Wrote 1 clusters*
>>>> 11/12/30 15:59:49 INFO driver.MahoutDriver: Program took 1171 ms
>>>> (Minutes:
>>>> 0.01951666666666667)
>>>> pd@PeriyaData:~/Mahout/**examples/bin$
>>>>
>>>>
>>>> pd@PeriyaData:~/Mahout/**examples/bin$ hadoop fs -ls
>>>> /output/mahout/kmeans/clusters
>>>> Found 2 items
>>>> drwxr-xr-x   - pd supergroup          0 2011-12-30 15:59
>>>> /output/mahout/kmeans/**clusters/clusteredPoints
>>>> drwxr-xr-x   - pd supergroup          0 2011-12-30 15:59
>>>> /output/mahout/kmeans/**clusters/clusters-1-final
>>>> pd@PeriyaData:~/Mahout/rabi/**examples/bin$ hadoop fs -ls
>>>> /output/mahout/kmeans/**clusters/clusters-1-final
>>>> Found 3 items
>>>> -rw-r--r--   1 pd supergroup          0 2011-12-30 15:59
>>>> /output/mahout/kmeans/**clusters/clusters-1-final/_**SUCCESS
>>>> drwxr-xr-x   - pd supergroup          0 2011-12-30 15:59
>>>> /output/mahout/kmeans/**clusters/clusters-1-final/_**logs
>>>> -rw-r--r--   1 pd supergroup      92991 2011-12-30 15:59
>>>> /output/mahout/kmeans/**clusters/clusters-1-final/**part-r-00000
>>>> pd@PeriyaData:~/Mahout/**examples/bin$
>>>>
>>>>
>>>> Thanks,
>>>> PD
>>>>
>>>>
>>> -----
>>> No virus found in this message.
>>> Checked by AVG - www.avg.com
>>> Version: 10.0.1416 / Virus Database: 2109/4122 - Release Date: 01/04/12
>>>
>>
>


Mime
  • Unnamed multipart/mixed (inline, None, 0 bytes)
View raw message