mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Periya.Data" <periya.d...@gmail.com>
Subject Re: number of clusters (Canopy Clustering)
Date Sat, 07 Jan 2012 05:31:15 GMT
I agree that if all the distances are < t2, I will get only one cluster. I
was just "hoping" that they do fall within that range and was basically
shooting in dark when twiddling with various t1 and t2 values.

Is there an easy way to determine the distance between vectors? In the
CanopyCluster shell script, I use EuclideanDistanceMeasure. The TFIDF
vectors are in binary and I have no idea how to proceed.

Is there a way for me to determine the distance from command line? So far,
I am not using any Java program to do my experiments. As a beginner, I am
running shell scripts and learning.

$MAHOUT_HOME/bin/mahout canopy       -i
/input/mahout/vectorized/tfidf-vectors \
                        -o
$HDFS_OUTPUT_DIR/bigdata-canopy-centroids \
                        -dm
org.apache.mahout.common.distance.EuclideanDistanceMeasure \
                        -t1          0.9 \
                        -t2          0.2 \
                        --overwrite



Thanks for your suggestions,
PD.


On Thu, Jan 5, 2012 at 8:47 PM, Paritosh Ranjan <pranjan@xebia.com> wrote:

> What is the distance between vectors with the Distance measure you are
> using?
> If all the vectors lie within the range of t2, then you will get only 1
> cluster.
>
> Write some piece of test code which creates vectors of the data you are
> using, and then find the distance between the vectors ( using the same
> distance measure you are using while clustering ). If all distances are
> within t2, then you will get only one cluster.
>
>
> On 05-01-2012 10:14, Periya.Data wrote:
>
>> Hi Paritosh,
>>     Thanks for your suggestions. I am currently trying to use Canopy
>> Clustering to guess the number of clusters. I have tried various values
>> (between 0 and 1) for t1 and t2 (t1>  t2). Still I get only one cluster. I
>> tried (0.9, 0.2), (0.05, 0.001), (0.005, 0.00001) etc. I thought if I make
>> t2 very close to 0, I would a lot of clusters...but, it is very
>> strange...I
>> am getting only one cluster for a vast set of t1/t2 values.
>>
>> Is this because I am using just one text file for my analysis?
>>
>> I have only one large text file and want to cluster the words and see how
>> they are clustered. I thought this would be a simple way to begin
>> exploring
>> clustering/mahout.
>>
>> Your suggestions are appreciated,
>> PD.
>>
>> On Sat, Dec 31, 2011 at 2:48 AM, Paritosh Ranjan<pranjan@xebia.com>
>>  wrote:
>>
>>  There can be two reasons for only one cluster being found.
>>>
>>> 1) The vectors are really close to each other and the clusters converge.
>>> 2) The distance measure you are using is not appropriate with your vector
>>> values.
>>>
>>> Try to
>>> 1) Analyze the vectors and the distance between them. Are they good
>>> candidates to be inside different clusters?
>>> 2) Try to use CanopyClustering first to guess the number of clusters (
>>> experiment a bit by changing values of t1 and t2 ).
>>> 3) Then provided the clusters returned by CanopyClustering to KMeans.
>>> 4) Use EuclideanDistance instead of Squared...
>>>
>>> Paritosh
>>>
>>> ______________________________**__________
>>> From: Periya.Data [periya.data@gmail.com]
>>> Sent: Saturday, December 31, 2011 1:07 AM
>>> To: user@mahout.apache.org
>>> Subject: number of clusters
>>>
>>> Hi all,
>>>    I am a newbie to Mahout. I am running a basic k-means clustering on a
>>> sample txt file. No matter what number I give to the --numClusters
>>> parameter, I always get only one cluster (VL-0). Can someone please point
>>> out any mistake and suggest what I should do to see a decent number of
>>> clusters?
>>>
>>> I successfully convert the txt file into seq-file and then to vectorized
>>> format.
>>>
>>> The command I use is the following:
>>>
>>> $MAHOUT_HOME/bin/mahout kmeans       --input
>>> /input/mahout/vectorized/**tfidf-vectors \
>>>                        --output           $HDFS_OUTPUT_DIR/clusters \
>>>                        --clusters         $HDFS_OUTPUT_DIR/**
>>> initialclusters
>>> \
>>>                        --maxIter          10 \
>>>                        --numClusters      20 \
>>>                        --clustering       \
>>>                        --overwrite
>>>
>>>
>>> Here is the console output:
>>> =====================
>>>
>>> pd@PeriyaData:~/Mahout/**examples/bin$ ./bigdata_kmeans.sh
>>> MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
>>> Running on hadoop, using HADOOP_HOME=/home/pd/CDH3/**hadoop
>>> HADOOP_CONF_DIR=/home/pd/CDH3/**hadoop/conf
>>> MAHOUT-JOB:
>>> /home/pd/CDH3/mahout/examples/**target/mahout-examples-0.6-**
>>> SNAPSHOT-job.jar
>>> 11/12/30 15:59:23 INFO common.AbstractJob: Command line arguments:
>>> {--clustering=null, --clusters=/output/mahout/**kmeans/initialclusters,
>>> --convergenceDelta=0.5,
>>>
>>> --distanceMeasure=org.apache.**mahout.common.distance.**
>>> SquaredEuclideanDistanceMeasur**e,
>>> --endPhase=2147483647, --input=/input/mahout/**vectorized/tfidf-vectors,
>>> --maxIter=10, --method=mapreduce, --numClusters=20,
>>> --output=/output/mahout/**kmeans/clusters, --overwrite=null,
>>> --startPhase=0,
>>> --tempDir=temp}
>>> 11/12/30 15:59:23 INFO common.HadoopUtil: Deleting
>>> /output/mahout/kmeans/clusters
>>> 11/12/30 15:59:24 INFO common.HadoopUtil: Deleting
>>> /output/mahout/kmeans/**initialclusters
>>> 11/12/30 15:59:25 INFO util.NativeCodeLoader: Loaded the native-hadoop
>>> library
>>> 11/12/30 15:59:25 INFO zlib.ZlibFactory: Successfully loaded&
>>>  initialized
>>>
>>> native-zlib library
>>> 11/12/30 15:59:25 INFO compress.CodecPool: Got brand-new compressor
>>> 11/12/30 15:59:25 INFO kmeans.RandomSeedGenerator: Wrote 20 vectors to
>>> /output/mahout/kmeans/**initialclusters/part-**randomSeed
>>> 11/12/30 15:59:25 INFO kmeans.KMeansDriver: Input:
>>> /input/mahout/vectorized/**tfidf-vectors Clusters In:
>>> /output/mahout/kmeans/**initialclusters/part-**randomSeed Out:
>>> /output/mahout/kmeans/clusters Distance:
>>> org.apache.mahout.common.**distance.**SquaredEuclideanDistanceMeasur**e
>>> 11/12/30 15:59:25 INFO kmeans.KMeansDriver: convergence: 0.5 max
>>> Iterations: 10 num Reduce Tasks: org.apache.mahout.math.**VectorWritable
>>> Input Vectors: {}
>>> 11/12/30 15:59:25 INFO kmeans.KMeansDriver: K-Means Iteration 1
>>> 11/12/30 15:59:25 INFO input.FileInputFormat: Total input paths to
>>> process
>>> : 1
>>> 11/12/30 15:59:26 INFO mapred.JobClient: Running job:
>>> job_201112301129_0029
>>> 11/12/30 15:59:27 INFO mapred.JobClient:  map 0% reduce 0%
>>> 11/12/30 15:59:30 INFO mapred.JobClient:  map 100% reduce 0%
>>> 11/12/30 15:59:39 INFO mapred.JobClient:  map 100% reduce 100%
>>> 11/12/30 15:59:39 INFO mapred.JobClient: Job complete:
>>> job_201112301129_0029
>>> 11/12/30 15:59:39 INFO mapred.JobClient: Counters: 23
>>> 11/12/30 15:59:39 INFO mapred.JobClient:   Job Counters
>>> 11/12/30 15:59:39 INFO mapred.JobClient:     Launched reduce tasks=1
>>> 11/12/30 15:59:39 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=3074
>>> 11/12/30 15:59:39 INFO mapred.JobClient:     Total time spent by all
>>> reduces waiting after reserving slots (ms)=0
>>> 11/12/30 15:59:39 INFO mapred.JobClient:     Total time spent by all maps
>>> waiting after reserving slots (ms)=0
>>> 11/12/30 15:59:39 INFO mapred.JobClient:     Launched map tasks=1
>>> 11/12/30 15:59:39 INFO mapred.JobClient:     Data-local map tasks=1
>>> 11/12/30 15:59:39 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=8299
>>> 11/12/30 15:59:39 INFO mapred.JobClient:   Clustering
>>> 11/12/30 15:59:39 INFO mapred.JobClient:     Converged Clusters=1
>>> 11/12/30 15:59:39 INFO mapred.JobClient:   FileSystemCounters
>>> 11/12/30 15:59:39 INFO mapred.JobClient:     FILE_BYTES_READ=185593
>>> 11/12/30 15:59:39 INFO mapred.JobClient:     HDFS_BYTES_READ=139801
>>> 11/12/30 15:59:39 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=477505
>>> 11/12/30 15:59:39 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=92991
>>> 11/12/30 15:59:39 INFO mapred.JobClient:   Map-Reduce Framework
>>> 11/12/30 15:59:39 INFO mapred.JobClient:     Reduce input groups=1
>>> 11/12/30 15:59:39 INFO mapred.JobClient:     Combine output records=1
>>> 11/12/30 15:59:39 INFO mapred.JobClient:     Map input records=1
>>> 11/12/30 15:59:39 INFO mapred.JobClient:     Reduce shuffle bytes=0
>>> 11/12/30 15:59:39 INFO mapred.JobClient:     Reduce output records=1
>>> 11/12/30 15:59:39 INFO mapred.JobClient:     Spilled Records=2
>>> 11/12/30 15:59:39 INFO mapred.JobClient:     Map output bytes=185582
>>> 11/12/30 15:59:39 INFO mapred.JobClient:     Combine input records=1
>>> 11/12/30 15:59:39 INFO mapred.JobClient:     Map output records=1
>>> 11/12/30 15:59:39 INFO mapred.JobClient:     SPLIT_RAW_BYTES=137
>>> 11/12/30 15:59:39 INFO mapred.JobClient:     Reduce input records=1
>>> 11/12/30 15:59:39 INFO kmeans.KMeansDriver: Clustering data
>>> 11/12/30 15:59:39 INFO kmeans.KMeansDriver: Running Clustering
>>> 11/12/30 15:59:39 INFO kmeans.KMeansDriver: Input:
>>> /input/mahout/vectorized/**tfidf-vectors Clusters In:
>>> /output/mahout/kmeans/**clusters/clusters-1-final Out:
>>> /output/mahout/kmeans/**clusters/clusteredPoints Distance:
>>> org.apache.mahout.common.**distance.**SquaredEuclideanDistanceMeasur**
>>> e@14e4e31
>>> 11/12/30 15:59:39 INFO kmeans.KMeansDriver: convergence: 0.5 Input
>>> Vectors:
>>> org.apache.mahout.math.**VectorWritable
>>> 11/12/30 15:59:40 INFO input.FileInputFormat: Total input paths to
>>> process
>>> : 1
>>> 11/12/30 15:59:40 INFO mapred.JobClient: Running job:
>>> job_201112301129_0030
>>> 11/12/30 15:59:41 INFO mapred.JobClient:  map 0% reduce 0%
>>> 11/12/30 15:59:45 INFO mapred.JobClient:  map 100% reduce 0%
>>> 11/12/30 15:59:45 INFO mapred.JobClient: Job complete:
>>> job_201112301129_0030
>>> 11/12/30 15:59:45 INFO mapred.JobClient: Counters: 13
>>> 11/12/30 15:59:45 INFO mapred.JobClient:   Job Counters
>>> 11/12/30 15:59:45 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=3815
>>> 11/12/30 15:59:45 INFO mapred.JobClient:     Total time spent by all
>>> reduces waiting after reserving slots (ms)=0
>>> 11/12/30 15:59:45 INFO mapred.JobClient:     Total time spent by all maps
>>> waiting after reserving slots (ms)=0
>>> 11/12/30 15:59:45 INFO mapred.JobClient:     Launched map tasks=1
>>> 11/12/30 15:59:45 INFO mapred.JobClient:     Data-local map tasks=1
>>> 11/12/30 15:59:45 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=0
>>> 11/12/30 15:59:45 INFO mapred.JobClient:   FileSystemCounters
>>> 11/12/30 15:59:45 INFO mapred.JobClient:     HDFS_BYTES_READ=186054
>>> 11/12/30 15:59:45 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=52059
>>> 11/12/30 15:59:45 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=92956
>>> 11/12/30 15:59:45 INFO mapred.JobClient:   Map-Reduce Framework
>>> 11/12/30 15:59:45 INFO mapred.JobClient:     Map input records=1
>>> 11/12/30 15:59:45 INFO mapred.JobClient:     Spilled Records=0
>>> 11/12/30 15:59:45 INFO mapred.JobClient:     Map output records=1
>>> 11/12/30 15:59:45 INFO mapred.JobClient:     SPLIT_RAW_BYTES=137
>>> 11/12/30 15:59:45 INFO driver.MahoutDriver: Program took 21888 ms
>>> (Minutes:
>>> 0.3648)
>>> MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
>>> Running on hadoop, using HADOOP_HOME=/home/pd/CDH3/**hadoop
>>> HADOOP_CONF_DIR=/home/pd/CDH3/**hadoop/conf
>>> MAHOUT-JOB:
>>> /home/pd/CDH3/mahout/examples/**target/mahout-examples-0.6-**
>>> SNAPSHOT-job.jar
>>> 11/12/30 15:59:48 INFO common.AbstractJob: Command line arguments:
>>> {--dictionary=/input/mahout/**vectorized/dictionary.file-0,
>>> --dictionaryType=sequencefile,
>>>
>>> --distanceMeasure=org.apache.**mahout.common.distance.**
>>> SquaredEuclideanDistanceMeasur**e,
>>> --endPhase=2147483647, --numWords=30,
>>> --output=/home/pd/Mahout/**examples/output/**clusteranalyze.txt,
>>> --outputFormat=TEXT,
>>> --pointsDir=/output/mahout/**kmeans/clusters/**clusteredPoints,
>>> --seqFileDir=/output/mahout/**kmeans/clusters/clusters-*-**final,
>>> --startPhase=0, --tempDir=temp}
>>> *11/12/30 15:59:49 INFO clustering.ClusterDumper: Wrote 1 clusters*
>>> 11/12/30 15:59:49 INFO driver.MahoutDriver: Program took 1171 ms
>>> (Minutes:
>>> 0.01951666666666667)
>>> pd@PeriyaData:~/Mahout/**examples/bin$
>>>
>>>
>>> pd@PeriyaData:~/Mahout/**examples/bin$ hadoop fs -ls
>>> /output/mahout/kmeans/clusters
>>> Found 2 items
>>> drwxr-xr-x   - pd supergroup          0 2011-12-30 15:59
>>> /output/mahout/kmeans/**clusters/clusteredPoints
>>> drwxr-xr-x   - pd supergroup          0 2011-12-30 15:59
>>> /output/mahout/kmeans/**clusters/clusters-1-final
>>> pd@PeriyaData:~/Mahout/rabi/**examples/bin$ hadoop fs -ls
>>> /output/mahout/kmeans/**clusters/clusters-1-final
>>> Found 3 items
>>> -rw-r--r--   1 pd supergroup          0 2011-12-30 15:59
>>> /output/mahout/kmeans/**clusters/clusters-1-final/_**SUCCESS
>>> drwxr-xr-x   - pd supergroup          0 2011-12-30 15:59
>>> /output/mahout/kmeans/**clusters/clusters-1-final/_**logs
>>> -rw-r--r--   1 pd supergroup      92991 2011-12-30 15:59
>>> /output/mahout/kmeans/**clusters/clusters-1-final/**part-r-00000
>>> pd@PeriyaData:~/Mahout/**examples/bin$
>>>
>>>
>>> Thanks,
>>> PD
>>>
>>>
>>
>> -----
>> No virus found in this message.
>> Checked by AVG - www.avg.com
>> Version: 10.0.1416 / Virus Database: 2109/4122 - Release Date: 01/04/12
>>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message