mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From 熊田 聖也 <seiya.kum...@cct-inc.co.jp>
Subject how to interpret the result of the clustering by “mahout kmeans”
Date Tue, 14 Jul 2015 08:20:35 GMT

Grad to see you.

This is my first question in the mahout mailing list.


I’m now calculating the clustering by using “mahout means.”

My data is as follows:


@RELATION rfm

@ATTRIBUTE recency NUMERIC

@ATTRIBUTE frequency NUMERIC

@ATTRIBUTE money NUMERIC

@ATTRIBUTE location NUMERIC

@ATTRIBUTE position NUMERIC

@DATA

0.472,0.275,0.099,0.952,0.047,

0.000,0.824,0.936,0.214,0.000,

0.000,0.537,0.656,0.591,0.000,

....

0.908,0.000,0.000,0.078,0.136,

0.134,0.000,0.000,0.781,0.160,

0.302,0.000,0.000,0.513,0.715,

0.472,0.000,0.000,0.749,0.047,


The file is the ARFF format.

Each row is the 5-dimensional vector and the most of rows contain zero values.

I converted the ARFF to the Vector format for the purpose of "mahout kmeans."

The resultant file is as follows:


Key: 0: Value: {0:0.472,1:0.275,2:0.099,3:0.952,4:0.047}

Key: 1: Value: {1:0.824,2:0.936,3:0.214}

Key: 2: Value: {1:0.537,2:0.656,3:0.591}

Key: 3: Value: {1:0.954,2:0.253,3:0.721}

Key: 4: Value: {1:0.187,2:0.735,3:0.782}

Key: 5: Value: {1:0.517,2:0.276,3:0.096}

Key: 6: Value: {1:0.189,2:0.127,3:0.517}

...

Key: 993: Value: {0:0.662,3:0.218,4:0.69}

Key: 994: Value: {0:0.56,3:0.682,4:0.153}

Key: 995: Value: {0:0.788,3:0.929,4:0.967}

Key: 996: Value: {0:0.908,3:0.078,4:0.136}

Key: 997: Value: {0:0.134,3:0.781,4:0.16}

Key: 998: Value: {0:0.302,3:0.513,4:0.715}

Key: 999: Value: {0:0.472,3:0.749,4:0.047}


In the above result, each vector is represented by the dictionary format, e.g.

{0:0.472,1:0.275,2:0.099,3:0.952,4:0.047}.


Using the file, I carried out "mahout kmeans."

(The current version of the mahout is 0.9.)

After the calculation, I typed “mahout clusterdump”

and got the result as shown below:


VL-648{n=172 c=[0.733, 0.608, 0.563] r=[0.168, 0.221, 0.235]}

VL-677{n=57 c=[0.445, 0.145, 0.839] r=[0.271, 0.099, 0.097]}

VL-429{n=40 c=[0.117, 0.768, 0.674] r=[0.078, 0.156, 0.159]}

VL-801{n=92 c=[0.318, 0.016, 0.007, 0.810, 0.191] r=[0.238, 0.060, 0.023, 0.137, 0.155]}

VL-322{n=55 c=[0.605, 0.872, 0.380] r=[0.217, 0.083, 0.204]}

VL-725{n=88 c=[0.351, 0.559, 0.760] r=[0.197, 0.206, 0.153]}

VL-197{n=176 c=[0.500, 0.482, 0.774] r=[0.264, 0.260, 0.141]}

VL-438{n=159 c=[0.618, 0.351, 0.288] r=[0.215, 0.203, 0.163]}

VL-58{n=54 c=[0.157, 0.515, 0.211] r=[0.102, 0.229, 0.143]}

VL-971{n=117 c=[0.339, 0.014, 0.007, 0.195, 0.282] r=[0.252, 0.052, 0.025, 0.133, 0.192]}


On the other hand, when the same calculation is done by the mahout with version 0.7, the result
is as follows:


VL-982{n=82 c=[0.124, 0.120, 0.108, 0.168, 0.150] r=[0.140, 0.177, 0.157, 0.115, 0.168]}

VL-989{n=72 c=[0:0.687, 3:0.185, 4:0.463] r=[0:0.145, 3:0.122, 4:0.207]}

VL-990{n=25 c=[0:0.808, 3:0.868, 4:0.320] r=[0:0.130, 3:0.103, 4:0.158]}

VL-992{n=45 c=[0:0.276, 3:0.821, 4:0.753] r=[0:0.135, 3:0.104, 4:0.165]}

VL-994{n=49 c=[0:0.630, 3:0.618, 4:0.336] r=[0:0.153, 3:0.130, 4:0.146]}

VL-995{n=74 c=[0:0.782, 3:0.673, 4:0.771] r=[0:0.127, 3:0.179, 4:0.136]}

VL-996{n=14 c=[0:0.842, 3:0.142, 4:0.147] r=[0:0.082, 3:0.140, 4:0.115]}

VL-997{n=452 c=[1:0.494, 2:0.521, 3:0.528] r=[1:0.280, 2:0.277, 3:0.275]}

VL-998{n=110 c=[0:0.354, 3:0.304, 4:0.764] r=[0:0.216, 3:0.178, 4:0.142]}

VL-999{n=77 c=[0.232, 0.012, 0.008, 0.732, 0.157] r=[0.169, 0.040, 0.026, 0.170, 0.135]}


In the result by the version 0.7, the centroid coordinate is represented by the dictionary
format, e.g.

c=[0:0.687, 3:0.185, 4:0.463], which means [0.687, 0, 0, 0.185, 0.463, 0].

However, in the result by version 0.9, we can not correctly know the centroid coordinate,

because we can not know zero positions.


Cloud you tell me how to interpret the result by the version 0.9 ?


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message