mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jeff Eastman <jeast...@Narus.com>
Subject RE: Analyzing the clusterdump output - kmeans clustering
Date Mon, 01 Aug 2011 16:15:35 GMT


-----Original Message-----
From: Abhik Banerjee [mailto:banerjee.abhik.hcl@gmail.com] 
Sent: Friday, July 29, 2011 3:48 PM
To: user@mahout.apache.org
Subject: Analyzing the clusterdump output - kmeans clustering

Hi,

I managed to run the kmeans algorithm on a cloudera vm , using the
help provided at the wiki and help at the forum . I got my output and
am trying to use the clusterdump to analyze my result.

 (I seemed to give 5 iterations , but it seems to have formed only 4
clusters , I am also curious about that , I ran this below command )

mahout kmeans -i hdfs://localhost/mahout_input/ip -o
hdfs://localhost/mahout_output/output_kmeans_07_29_1/ -dm
org.apache.mahout.common.distance.EuclideanDistanceMeasure -cd 1.0 -c
hdfs://localhost/mahout_input/centroids_07_29_1 -k 5 -x 5 -cl

after k means completion on hadoop cloudera vm I ran this command :-

mahout clusterdump --seqFileDir
hdfs://localhost/mahout_output/output_kmeans_07_29_1/clusters-5/part-r-00000
--pointsDir hdfs://localhost/mahout_output/output_kmeans_07_29_1/clusteredPoints
--output kmeans_07_29_1_cl5.tx

and when I look into the text file I see a structure like this

CL-99871{n=10157 c=[186:12.229, 189:9.343, 212:2.716] r=[186:7.803, 189:8.054, 2
12:4.686]}
	Weight:  Point:
	1.0: 1.161.199.19 = [186:22.000, 189:32.000]
	1.0: 1.161.204.226 = [186:9.000, 189:11.000]
	1.0: 1.170.149.79 = [186:18.000, 189:10.000]
	1.0: 1.175.137.84 = [186:23.000, 189:8.000]
	1.0: 1.176.27.109 = [186:7.000, 189:9.000, 212:3.000]
	1.0: 1.177.175.26 = [186:12.000, 189:12.000]
	1.0: 1.197.208.25 = [186:26.000]
	1.0: 1.212.176.27 = [186:11.000, 189:1.000]
	1.0: 1.212.176.28 = [186:11.000, 189:6.000]
	1.0: 1.22.160.35 = [186:17.000, 189:6.000]
	1.0: 1.230.123.81 = [186:18.000, 189:4.000]

I can figure the first part of it , as explained in the wiki , that
the name is CL-99871 , number of points is 10157 , cluster center is [
] in the vector form , radius is [ ] ,

I dont understand how the later part of it is structured , the Ip
addresses are my name - data points which I wanted to get clustered,
what do those vector values mean , if they mean the vectors of those
points , I am not sure why they are only 2 dimensional as my original
data points were consisting of 288 dimensions , for each ip address.

Thanks for all the help,
Abhik

Mime
View raw message