# mahout-dev mailing list archives

##### Site index · List index
Message view
Top
From Jeff Eastman <j...@windwardsolutions.com>
Subject Re: kmeans does not calculate distance from the centroid in 0.7 or 0.8
Date Fri, 29 Jun 2012 19:48:40 GMT
```+dev@m.a.o  Let's have this conversation for everybody on the list too

The pdf() of all DistanceMeasureClusters is:

public double pdf(VectorWritable vw) {
return 1 / (1 + measure.distance(vw.get(), getCenter()));
}

for CosineDistance, the pdf values should be distributed on 1..2. Aha!
if you look at AbstractClusteringPolicy.classify() what is happening is
the pdf vector is being normalized:

public Vector classify(Vector data, ClusterClassifier prior) {
List<Cluster> models = prior.getModels();
int i = 0;
Vector pdfs = new DenseVector(models.size());
for (Cluster model : models) {
pdfs.set(i++, model.pdf(new VectorWritable(data)));
}
return pdfs.assign(new TimesFunction(), 1.0 / pdfs.zSum());
}

... and that will surely mess up the reverse distance calculation. Is
there a way around this? Let me stew about it some...

Jeff

On 6/29/12 3:27 PM, Pat Ferrel wrote:
> Whoa, the 0.7 snapshot message below gave me an idea that I had some
> old artifacts in the path. Took them out and it IS working.
>
> However, sorry if I'm being dense, but the formula for pdf given is
> pdf = 1/(1+distance) unless I messed up my algebra that means
> distance = (1/pdf) - 1 which gives values impossible with cosine.
>
> It almost looks like the weights below are 1- distance so distance =
> 1-pdf?
>
> Maclaurin:big-data pat\$ mahout seqdumper -i
> b2/kmeans-clusters/clusteredPoints/part-m-00000 | moreMAHOUT_LOCAL is
> MAHOUT_LOCAL is set, running locally
> SLF4J: Class path contains multiple SLF4J bindings.
> SLF4J: Found binding in
> [jar:file:/Users/pat/Projects/mahout/examples/target/mahout-examples-0.8-SNAPSHOT-job.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: Found binding in
> [jar:file:/Users/pat/Projects/mahout/examples/target/dependency/slf4j-jcl-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: Found binding in
> [jar:file:/Users/pat/Projects/mahout/examples/target/dependency/slf4j-jcl-1.6.6.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: Found binding in
> [jar:file:/Users/pat/Projects/mahout/examples/target/dependency/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an
> explanation.
> SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
> 12/06/29 11:58:55 INFO common.AbstractJob: Command line arguments:
> {--endPhase=[2147483647],
> --input=[b2/kmeans-clusters/clusteredPoints/part-m-00000],
> --startPhase=[0], --tempDir=[temp]}
> 2012-06-29 11:58:55.449 java[27127:1903] Unable to load realm info
> from SCDynamicStore
> Input Path: b2/kmeans-clusters/clusteredPoints/part-m-00000
> Key class: class org.apache.hadoop.io.IntWritable Value Class: class
> org.apache.mahout.clustering.classify.WeightedVectorWritable
> Key: 832: Value: 0.02946182601035338: http://farfetchers.com/ =
> [2223:0.729, 2862:0.501, 3573:0.467]
> Key: 819: Value: 0.03323576094647134: http://farfetchers.com/blog =
> [1:0.034, 9:0.021, 27:0.039, 28:0.026, 31:0.022, 33:0.032, 37:0.034,
> 38:0.022, 39:0.043, 44:0.029, 49:0.022, 51:0.025, 56:0.024, 60:0.029,
> 72:0.038, 74:0.020, 81:0.035, 82:0.037, 87:0.041, 89:0.033, 91:0.032,
> 104:0.034, 107:0.039, 112:0.034, 116:0.043, 121:0.017, 129:0.034,
> 136:0.035, 147:0.035, 148:0.031, 161:0.035,
> On 6/29/12 11:08 AM, Pat Ferrel wrote:
>> Hmm, still the data in kmeans-clusters/clusteredPoints/part-m-00000
>> has all weights of 1.0
>>
>> I checked to make sure the data was created with rebuilt code and
>> that git knew the patched files were changed so the patch was
>> included. I see the code in the IDE but I build with maven skipping
>> tests. I looked through quite a few so can assume all are 1.0.
>>
>> Maclaurin:big-data pat\$ mahout seqdumper -i
>> b2/kmeans-clusters/clusteredPoints/part-m-00000 | more
>> MAHOUT_LOCAL is set, running locally
>> SLF4J: Class path contains multiple SLF4J bindings.
>> SLF4J: Found binding in
>> [jar:file:/Users/pat/Projects/mahout/examples/target/mahout-examples-0.7-SNAPSHOT-job.jar!/org/slf4j/impl/StaticLoggerBinder.class]
>> SLF4J: Found binding in
>> [jar:file:/Users/pat/Projects/mahout/examples/target/mahout-examples-0.8-SNAPSHOT-job.jar!/org/slf4j/impl/StaticLoggerBinder.class]
>> SLF4J: Found binding in
>> [jar:file:/Users/pat/Projects/mahout/examples/target/dependency/slf4j-jcl-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
>> SLF4J: Found binding in
>> [jar:file:/Users/pat/Projects/mahout/examples/target/dependency/slf4j-jcl-1.6.6.jar!/org/slf4j/impl/StaticLoggerBinder.class]
>> SLF4J: Found binding in
>> [jar:file:/Users/pat/Projects/mahout/examples/target/dependency/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
>> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an
>> explanation.
>> 12/06/29 10:58:16 INFO common.AbstractJob: Command line arguments:
>> {--endPhase=[2147483647],
>> --input=[b2/kmeans-clusters/clusteredPoints/part-m-00000],
>> --startPhase=[0], --tempDir=[temp]}
>> 2012-06-29 10:58:16.587 java[25768:1903] Unable to load realm info
>> from SCDynamicStore
>> Input Path: b2/kmeans-clusters/clusteredPoints/part-m-00000
>> Key class: class org.apache.hadoop.io.IntWritable Value Class: class
>> org.apache.mahout.clustering.classify.WeightedVectorWritable
>> Key: 792: Value: 1.0: http://farfetchers.com/ = [2223:0.729,
>> 2862:0.501, 3573:0.467]
>> Key: 791: Value: 1.0: http://farfetchers.com/blog = [1:0.034,
>> 9:0.021, 27:0.039, 28:0.026, 31:0.022, 33:0.032, 37:0.034, 38:0.022,
>> 39:0.043, 44:0.029, 49:0.022, 51:0.025, 56:0.024, 60:0.029, 72:0.038,
>> 74:0.020, 81:0.035, 82:0.037, 87:0.041, 89:0.033, 91:0.032,
>> 104:0.034, 107:0.039, 112:0.034, 1
>>
>>
>> On 6/29/12 10:06 AM, Jeff Eastman wrote:
>>> You were correct, the documented weights were not being set. I just
>>> uploaded a much smaller patch that fixes that. Please let me know if
>>> that works for you.
>>>
>>> Jeff
>>>
>>> On 6/29/12 12:27 PM, Pat Ferrel wrote:
>>>> OK. It's actually in the docs, MiA at least, that it will be 1 or 0
>>>> (never 0 in kmeans since the 0 docs are dropped from clusteredPoints).
>>>>
>>>> I mention the patch only because it would be easy enough to put the
>>>> pdf in the properties there if I knew where to look for it.
>>>>
>>>> On 6/29/12 9:21 AM, Jeff Eastman wrote:
>>>>> HMN, let me investigate this.
>>>>>
>>>>>
>>>>> On 6/29/12 12:01 PM, Pat Ferrel wrote:
>>>>>>
>>>>>> What is returned as the weight in the WeightedVectorWritable is
>>>>>> pdfPerCluster.maxValue(), which is 1.0 for kmeans and so you
>>>>>> cannot calculate the distance from this.
>>>>>>
>>>>>> I'd fix this in the patch but I don't know where to find the
>>>>>> actual pdf for kmeans since the one returned it is rounded to 1 or
0.
>>>>>>
>>>>>
>>>>
>>>>
>>>
>>
>>
>
>

```
Mime
• Unnamed multipart/mixed (inline, None, 0 bytes)
View raw message