mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jeff Eastman <j...@windwardsolutions.com>
Subject Re: kmeans does not calculate distance from the centroid in 0.7 or 0.8
Date Fri, 29 Jun 2012 20:05:36 GMT
I just tried removing the normalization step and DisplayKMeans produces 
exactly the same result. Since the pdfs vector us just an accumulation 
of pdf values I think perhaps the normalization isn't necessary. The 
only gotcha would be if a ClusterClassifier were ever used as an 
AbstractVectorClassifier, since that API implies normalization (the 
final value is 1-sum_of_scores). But ClusterClassifier returns the whole 
vector so it really doesn't satisfy that API.

Does anybody care?

Can you try that? Does it give you realistic distances now?

e.g. return pdfs;



On 6/29/12 3:48 PM, Jeff Eastman wrote:
> +dev@m.a.o  Let's have this conversation for everybody on the list too
>
> The pdf() of all DistanceMeasureClusters is:
>
>   public double pdf(VectorWritable vw) {
>     return 1 / (1 + measure.distance(vw.get(), getCenter()));
>   }
>
> for CosineDistance, the pdf values should be distributed on 1..2. Aha! 
> if you look at AbstractClusteringPolicy.classify() what is happening 
> is the pdf vector is being normalized:
>
>   public Vector classify(Vector data, ClusterClassifier prior) {
>     List<Cluster> models = prior.getModels();
>     int i = 0;
>     Vector pdfs = new DenseVector(models.size());
>     for (Cluster model : models) {
>       pdfs.set(i++, model.pdf(new VectorWritable(data)));
>     }
>     return pdfs.assign(new TimesFunction(), 1.0 / pdfs.zSum());
>   }
>
> ... and that will surely mess up the reverse distance calculation. Is 
> there a way around this? Let me stew about it some...
>
> Jeff
>
> On 6/29/12 3:27 PM, Pat Ferrel wrote:
>> Whoa, the 0.7 snapshot message below gave me an idea that I had some 
>> old artifacts in the path. Took them out and it IS working.
>>
>> However, sorry if I'm being dense, but the formula for pdf given is 
>> pdf = 1/(1+distance) unless I messed up my algebra that means
>> distance = (1/pdf) - 1 which gives values impossible with cosine.
>>
>> It almost looks like the weights below are 1- distance so distance = 
>> 1-pdf?
>>
>> Maclaurin:big-data pat$ mahout seqdumper -i 
>> b2/kmeans-clusters/clusteredPoints/part-m-00000 | moreMAHOUT_LOCAL is 
>> set, so we don't add HADOOP_CONF_DIR to classpath.
>> MAHOUT_LOCAL is set, running locally
>> SLF4J: Class path contains multiple SLF4J bindings.
>> SLF4J: Found binding in 
>> [jar:file:/Users/pat/Projects/mahout/examples/target/mahout-examples-0.8-SNAPSHOT-job.jar!/org/slf4j/impl/StaticLoggerBinder.class]
>> SLF4J: Found binding in 
>> [jar:file:/Users/pat/Projects/mahout/examples/target/dependency/slf4j-jcl-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
>> SLF4J: Found binding in 
>> [jar:file:/Users/pat/Projects/mahout/examples/target/dependency/slf4j-jcl-1.6.6.jar!/org/slf4j/impl/StaticLoggerBinder.class]
>> SLF4J: Found binding in 
>> [jar:file:/Users/pat/Projects/mahout/examples/target/dependency/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
>> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an 
>> explanation.
>> SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
>> 12/06/29 11:58:55 INFO common.AbstractJob: Command line arguments: 
>> {--endPhase=[2147483647], 
>> --input=[b2/kmeans-clusters/clusteredPoints/part-m-00000], 
>> --startPhase=[0], --tempDir=[temp]}
>> 2012-06-29 11:58:55.449 java[27127:1903] Unable to load realm info 
>> from SCDynamicStore
>> Input Path: b2/kmeans-clusters/clusteredPoints/part-m-00000
>> Key class: class org.apache.hadoop.io.IntWritable Value Class: class 
>> org.apache.mahout.clustering.classify.WeightedVectorWritable
>> Key: 832: Value: 0.02946182601035338: http://farfetchers.com/ = 
>> [2223:0.729, 2862:0.501, 3573:0.467]
>> Key: 819: Value: 0.03323576094647134: http://farfetchers.com/blog = 
>> [1:0.034, 9:0.021, 27:0.039, 28:0.026, 31:0.022, 33:0.032, 37:0.034, 
>> 38:0.022, 39:0.043, 44:0.029, 49:0.022, 51:0.025, 56:0.024, 60:0.029, 
>> 72:0.038, 74:0.020, 81:0.035, 82:0.037, 87:0.041, 89:0.033, 91:0.032, 
>> 104:0.034, 107:0.039, 112:0.034, 116:0.043, 121:0.017, 129:0.034, 
>> 136:0.035, 147:0.035, 148:0.031, 161:0.035,
>> On 6/29/12 11:08 AM, Pat Ferrel wrote:
>>> Hmm, still the data in kmeans-clusters/clusteredPoints/part-m-00000 
>>> has all weights of 1.0
>>>
>>> I checked to make sure the data was created with rebuilt code and 
>>> that git knew the patched files were changed so the patch was 
>>> included. I see the code in the IDE but I build with maven skipping 
>>> tests. I looked through quite a few so can assume all are 1.0.
>>>
>>> Maclaurin:big-data pat$ mahout seqdumper -i 
>>> b2/kmeans-clusters/clusteredPoints/part-m-00000 | more
>>> MAHOUT_LOCAL is set, so we don't add HADOOP_CONF_DIR to classpath.
>>> MAHOUT_LOCAL is set, running locally
>>> SLF4J: Class path contains multiple SLF4J bindings.
>>> SLF4J: Found binding in 
>>> [jar:file:/Users/pat/Projects/mahout/examples/target/mahout-examples-0.7-SNAPSHOT-job.jar!/org/slf4j/impl/StaticLoggerBinder.class]
>>> SLF4J: Found binding in 
>>> [jar:file:/Users/pat/Projects/mahout/examples/target/mahout-examples-0.8-SNAPSHOT-job.jar!/org/slf4j/impl/StaticLoggerBinder.class]
>>> SLF4J: Found binding in 
>>> [jar:file:/Users/pat/Projects/mahout/examples/target/dependency/slf4j-jcl-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
>>> SLF4J: Found binding in 
>>> [jar:file:/Users/pat/Projects/mahout/examples/target/dependency/slf4j-jcl-1.6.6.jar!/org/slf4j/impl/StaticLoggerBinder.class]
>>> SLF4J: Found binding in 
>>> [jar:file:/Users/pat/Projects/mahout/examples/target/dependency/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
>>> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an 
>>> explanation.
>>> 12/06/29 10:58:16 INFO common.AbstractJob: Command line arguments: 
>>> {--endPhase=[2147483647], 
>>> --input=[b2/kmeans-clusters/clusteredPoints/part-m-00000], 
>>> --startPhase=[0], --tempDir=[temp]}
>>> 2012-06-29 10:58:16.587 java[25768:1903] Unable to load realm info 
>>> from SCDynamicStore
>>> Input Path: b2/kmeans-clusters/clusteredPoints/part-m-00000
>>> Key class: class org.apache.hadoop.io.IntWritable Value Class: class 
>>> org.apache.mahout.clustering.classify.WeightedVectorWritable
>>> Key: 792: Value: 1.0: http://farfetchers.com/ = [2223:0.729, 
>>> 2862:0.501, 3573:0.467]
>>> Key: 791: Value: 1.0: http://farfetchers.com/blog = [1:0.034, 
>>> 9:0.021, 27:0.039, 28:0.026, 31:0.022, 33:0.032, 37:0.034, 38:0.022, 
>>> 39:0.043, 44:0.029, 49:0.022, 51:0.025, 56:0.024, 60:0.029, 
>>> 72:0.038, 74:0.020, 81:0.035, 82:0.037, 87:0.041, 89:0.033, 
>>> 91:0.032, 104:0.034, 107:0.039, 112:0.034, 1
>>>
>>>
>>> On 6/29/12 10:06 AM, Jeff Eastman wrote:
>>>> You were correct, the documented weights were not being set. I just 
>>>> uploaded a much smaller patch that fixes that. Please let me know 
>>>> if that works for you.
>>>>
>>>> Jeff
>>>>
>>>> On 6/29/12 12:27 PM, Pat Ferrel wrote:
>>>>> OK. It's actually in the docs, MiA at least, that it will be 1 or 
>>>>> 0 (never 0 in kmeans since the 0 docs are dropped from 
>>>>> clusteredPoints).
>>>>>
>>>>> I mention the patch only because it would be easy enough to put 
>>>>> the pdf in the properties there if I knew where to look for it.
>>>>>
>>>>> On 6/29/12 9:21 AM, Jeff Eastman wrote:
>>>>>> HMN, let me investigate this.
>>>>>>
>>>>>>
>>>>>> On 6/29/12 12:01 PM, Pat Ferrel wrote:
>>>>>>>
>>>>>>> What is returned as the weight in the WeightedVectorWritable
is 
>>>>>>> pdfPerCluster.maxValue(), which is 1.0 for kmeans and so you

>>>>>>> cannot calculate the distance from this.
>>>>>>>
>>>>>>> I'd fix this in the patch but I don't know where to find the

>>>>>>> actual pdf for kmeans since the one returned it is rounded to
1 
>>>>>>> or 0.
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>>
>>
>>
>


Mime
  • Unnamed multipart/mixed (inline, None, 0 bytes)
View raw message