Return-Path: X-Original-To: apmail-mahout-dev-archive@www.apache.org Delivered-To: apmail-mahout-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 2697CDCE1 for ; Fri, 29 Jun 2012 20:06:08 +0000 (UTC) Received: (qmail 74082 invoked by uid 500); 29 Jun 2012 20:06:07 -0000 Delivered-To: apmail-mahout-dev-archive@mahout.apache.org Received: (qmail 74017 invoked by uid 500); 29 Jun 2012 20:06:07 -0000 Mailing-List: contact dev-help@mahout.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@mahout.apache.org Delivered-To: mailing list dev@mahout.apache.org Received: (qmail 74009 invoked by uid 99); 29 Jun 2012 20:06:07 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 29 Jun 2012 20:06:07 +0000 X-ASF-Spam-Status: No, hits=2.2 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_NONE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: local policy) Received: from [206.188.198.66] (HELO omr1pod1.networksolutionsemail.com) (206.188.198.66) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 29 Jun 2012 20:05:59 +0000 Received: from cm-omr2pod1 (mailpod1.networksolutionsemail.com [206.188.198.65]) by omr1pod1.networksolutionsemail.com (8.13.8/8.13.8) with ESMTP id q5TK5bTP009259 for ; Fri, 29 Jun 2012 16:05:37 -0400 Authentication-Results: cm-omr2pod1 smtp.user=jeastman@windwardsolutions.com; auth=pass (LOGIN) X-Authenticated-UID: jeastman@windwardsolutions.com Received: from [76.189.175.0] ([76.189.175.0:33164] helo=Jeffs-New-MacBook-Pro.local) by cm-omr2pod1 (envelope-from ) (ecelerity 2.2.2.41 r(31179/31189)) with ESMTPA id 67/F3-29422-19A0EEF4; Fri, 29 Jun 2012 16:05:37 -0400 Message-ID: <4FEE0A90.3070807@windwardsolutions.com> Date: Fri, 29 Jun 2012 16:05:36 -0400 From: Jeff Eastman User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.7; rv:12.0) Gecko/20120428 Thunderbird/12.0.1 MIME-Version: 1.0 To: dev@mahout.apache.org Subject: Re: kmeans does not calculate distance from the centroid in 0.7 or 0.8 References: <4FEB30DF.708@farfetchers.com> <4FEB932C.3080107@occamsmachete.com> <4FEDCF6D.1030507@windwardsolutions.com> <4FEDD16C.8030709@occamsmachete.com> <4FEDD61A.5050902@windwardsolutions.com> <4FEDD772.7040601@occamsmachete.com> <4FEDE0B0.3040000@windwardsolutions.com> <4FEDEF14.8030501@occamsmachete.com> <4FEE01AC.9090102@occamsmachete.com> <4FEE0698.9090109@windwardsolutions.com> In-Reply-To: <4FEE0698.9090109@windwardsolutions.com> Content-Type: multipart/mixed; boundary="------------000007050108090909030603" --------------000007050108090909030603 Content-Type: multipart/alternative; boundary="------------060904070609020601050405" --------------060904070609020601050405 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit I just tried removing the normalization step and DisplayKMeans produces exactly the same result. Since the pdfs vector us just an accumulation of pdf values I think perhaps the normalization isn't necessary. The only gotcha would be if a ClusterClassifier were ever used as an AbstractVectorClassifier, since that API implies normalization (the final value is 1-sum_of_scores). But ClusterClassifier returns the whole vector so it really doesn't satisfy that API. Does anybody care? Can you try that? Does it give you realistic distances now? e.g. return pdfs; On 6/29/12 3:48 PM, Jeff Eastman wrote: > +dev@m.a.o Let's have this conversation for everybody on the list too > > The pdf() of all DistanceMeasureClusters is: > > public double pdf(VectorWritable vw) { > return 1 / (1 + measure.distance(vw.get(), getCenter())); > } > > for CosineDistance, the pdf values should be distributed on 1..2. Aha! > if you look at AbstractClusteringPolicy.classify() what is happening > is the pdf vector is being normalized: > > public Vector classify(Vector data, ClusterClassifier prior) { > List models = prior.getModels(); > int i = 0; > Vector pdfs = new DenseVector(models.size()); > for (Cluster model : models) { > pdfs.set(i++, model.pdf(new VectorWritable(data))); > } > return pdfs.assign(new TimesFunction(), 1.0 / pdfs.zSum()); > } > > ... and that will surely mess up the reverse distance calculation. Is > there a way around this? Let me stew about it some... > > Jeff > > On 6/29/12 3:27 PM, Pat Ferrel wrote: >> Whoa, the 0.7 snapshot message below gave me an idea that I had some >> old artifacts in the path. Took them out and it IS working. >> >> However, sorry if I'm being dense, but the formula for pdf given is >> pdf = 1/(1+distance) unless I messed up my algebra that means >> distance = (1/pdf) - 1 which gives values impossible with cosine. >> >> It almost looks like the weights below are 1- distance so distance = >> 1-pdf? >> >> Maclaurin:big-data pat$ mahout seqdumper -i >> b2/kmeans-clusters/clusteredPoints/part-m-00000 | moreMAHOUT_LOCAL is >> set, so we don't add HADOOP_CONF_DIR to classpath. >> MAHOUT_LOCAL is set, running locally >> SLF4J: Class path contains multiple SLF4J bindings. >> SLF4J: Found binding in >> [jar:file:/Users/pat/Projects/mahout/examples/target/mahout-examples-0.8-SNAPSHOT-job.jar!/org/slf4j/impl/StaticLoggerBinder.class] >> SLF4J: Found binding in >> [jar:file:/Users/pat/Projects/mahout/examples/target/dependency/slf4j-jcl-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class] >> SLF4J: Found binding in >> [jar:file:/Users/pat/Projects/mahout/examples/target/dependency/slf4j-jcl-1.6.6.jar!/org/slf4j/impl/StaticLoggerBinder.class] >> SLF4J: Found binding in >> [jar:file:/Users/pat/Projects/mahout/examples/target/dependency/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class] >> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an >> explanation. >> SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory] >> 12/06/29 11:58:55 INFO common.AbstractJob: Command line arguments: >> {--endPhase=[2147483647], >> --input=[b2/kmeans-clusters/clusteredPoints/part-m-00000], >> --startPhase=[0], --tempDir=[temp]} >> 2012-06-29 11:58:55.449 java[27127:1903] Unable to load realm info >> from SCDynamicStore >> Input Path: b2/kmeans-clusters/clusteredPoints/part-m-00000 >> Key class: class org.apache.hadoop.io.IntWritable Value Class: class >> org.apache.mahout.clustering.classify.WeightedVectorWritable >> Key: 832: Value: 0.02946182601035338: http://farfetchers.com/ = >> [2223:0.729, 2862:0.501, 3573:0.467] >> Key: 819: Value: 0.03323576094647134: http://farfetchers.com/blog = >> [1:0.034, 9:0.021, 27:0.039, 28:0.026, 31:0.022, 33:0.032, 37:0.034, >> 38:0.022, 39:0.043, 44:0.029, 49:0.022, 51:0.025, 56:0.024, 60:0.029, >> 72:0.038, 74:0.020, 81:0.035, 82:0.037, 87:0.041, 89:0.033, 91:0.032, >> 104:0.034, 107:0.039, 112:0.034, 116:0.043, 121:0.017, 129:0.034, >> 136:0.035, 147:0.035, 148:0.031, 161:0.035, >> On 6/29/12 11:08 AM, Pat Ferrel wrote: >>> Hmm, still the data in kmeans-clusters/clusteredPoints/part-m-00000 >>> has all weights of 1.0 >>> >>> I checked to make sure the data was created with rebuilt code and >>> that git knew the patched files were changed so the patch was >>> included. I see the code in the IDE but I build with maven skipping >>> tests. I looked through quite a few so can assume all are 1.0. >>> >>> Maclaurin:big-data pat$ mahout seqdumper -i >>> b2/kmeans-clusters/clusteredPoints/part-m-00000 | more >>> MAHOUT_LOCAL is set, so we don't add HADOOP_CONF_DIR to classpath. >>> MAHOUT_LOCAL is set, running locally >>> SLF4J: Class path contains multiple SLF4J bindings. >>> SLF4J: Found binding in >>> [jar:file:/Users/pat/Projects/mahout/examples/target/mahout-examples-0.7-SNAPSHOT-job.jar!/org/slf4j/impl/StaticLoggerBinder.class] >>> SLF4J: Found binding in >>> [jar:file:/Users/pat/Projects/mahout/examples/target/mahout-examples-0.8-SNAPSHOT-job.jar!/org/slf4j/impl/StaticLoggerBinder.class] >>> SLF4J: Found binding in >>> [jar:file:/Users/pat/Projects/mahout/examples/target/dependency/slf4j-jcl-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class] >>> SLF4J: Found binding in >>> [jar:file:/Users/pat/Projects/mahout/examples/target/dependency/slf4j-jcl-1.6.6.jar!/org/slf4j/impl/StaticLoggerBinder.class] >>> SLF4J: Found binding in >>> [jar:file:/Users/pat/Projects/mahout/examples/target/dependency/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class] >>> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an >>> explanation. >>> 12/06/29 10:58:16 INFO common.AbstractJob: Command line arguments: >>> {--endPhase=[2147483647], >>> --input=[b2/kmeans-clusters/clusteredPoints/part-m-00000], >>> --startPhase=[0], --tempDir=[temp]} >>> 2012-06-29 10:58:16.587 java[25768:1903] Unable to load realm info >>> from SCDynamicStore >>> Input Path: b2/kmeans-clusters/clusteredPoints/part-m-00000 >>> Key class: class org.apache.hadoop.io.IntWritable Value Class: class >>> org.apache.mahout.clustering.classify.WeightedVectorWritable >>> Key: 792: Value: 1.0: http://farfetchers.com/ = [2223:0.729, >>> 2862:0.501, 3573:0.467] >>> Key: 791: Value: 1.0: http://farfetchers.com/blog = [1:0.034, >>> 9:0.021, 27:0.039, 28:0.026, 31:0.022, 33:0.032, 37:0.034, 38:0.022, >>> 39:0.043, 44:0.029, 49:0.022, 51:0.025, 56:0.024, 60:0.029, >>> 72:0.038, 74:0.020, 81:0.035, 82:0.037, 87:0.041, 89:0.033, >>> 91:0.032, 104:0.034, 107:0.039, 112:0.034, 1 >>> >>> >>> On 6/29/12 10:06 AM, Jeff Eastman wrote: >>>> You were correct, the documented weights were not being set. I just >>>> uploaded a much smaller patch that fixes that. Please let me know >>>> if that works for you. >>>> >>>> Jeff >>>> >>>> On 6/29/12 12:27 PM, Pat Ferrel wrote: >>>>> OK. It's actually in the docs, MiA at least, that it will be 1 or >>>>> 0 (never 0 in kmeans since the 0 docs are dropped from >>>>> clusteredPoints). >>>>> >>>>> I mention the patch only because it would be easy enough to put >>>>> the pdf in the properties there if I knew where to look for it. >>>>> >>>>> On 6/29/12 9:21 AM, Jeff Eastman wrote: >>>>>> HMN, let me investigate this. >>>>>> >>>>>> >>>>>> On 6/29/12 12:01 PM, Pat Ferrel wrote: >>>>>>> >>>>>>> What is returned as the weight in the WeightedVectorWritable is >>>>>>> pdfPerCluster.maxValue(), which is 1.0 for kmeans and so you >>>>>>> cannot calculate the distance from this. >>>>>>> >>>>>>> I'd fix this in the patch but I don't know where to find the >>>>>>> actual pdf for kmeans since the one returned it is rounded to 1 >>>>>>> or 0. >>>>>>> >>>>>> >>>>> >>>>> >>>> >>> >>> >> >> > --------------060904070609020601050405 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: 8bit I just tried removing the normalization step and DisplayKMeans produces exactly the same result. Since the pdfs vector us just an accumulation of pdf values I think perhaps the normalization isn't necessary. The only gotcha would be if a ClusterClassifier were ever used as an AbstractVectorClassifier, since that API implies normalization (the final value is 1-sum_of_scores). But ClusterClassifier returns the whole vector so it really doesn't satisfy that API.

Does anybody care?

Can you try that? Does it give you realistic distances now?

e.g. return pdfs;



On 6/29/12 3:48 PM, Jeff Eastman wrote:
+dev@m.a.o  Let's have this conversation for everybody on the list too

The pdf() of all DistanceMeasureClusters is:

  public double pdf(VectorWritable vw) {
    return 1 / (1 + measure.distance(vw.get(), getCenter()));
  }

for CosineDistance, the pdf values should be distributed on 1..2. Aha! if you look at AbstractClusteringPolicy.classify() what is happening is the pdf vector is being normalized:

  public Vector classify(Vector data, ClusterClassifier prior) {
    List<Cluster> models = prior.getModels();
    int i = 0;
    Vector pdfs = new DenseVector(models.size());
    for (Cluster model : models) {
      pdfs.set(i++, model.pdf(new VectorWritable(data)));
    }
    return pdfs.assign(new TimesFunction(), 1.0 / pdfs.zSum());
  }

... and that will surely mess up the reverse distance calculation. Is there a way around this? Let me stew about it some...

Jeff

On 6/29/12 3:27 PM, Pat Ferrel wrote:
Whoa, the 0.7 snapshot message below gave me an idea that I had some old artifacts in the path. Took them out and it IS working.

However, sorry if I'm being dense, but the formula for pdf given is pdf = 1/(1+distance) unless I messed up my algebra that means
distance = (1/pdf) - 1 which gives values impossible with cosine.

It almost looks like the weights below are 1- distance so distance = 1-pdf?

Maclaurin:big-data pat$ mahout seqdumper -i b2/kmeans-clusters/clusteredPoints/part-m-00000 | moreMAHOUT_LOCAL is set, so we don't add HADOOP_CONF_DIR to classpath.
MAHOUT_LOCAL is set, running locally
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/Users/pat/Projects/mahout/examples/target/mahout-examples-0.8-SNAPSHOT-job.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/Users/pat/Projects/mahout/examples/target/dependency/slf4j-jcl-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/Users/pat/Projects/mahout/examples/target/dependency/slf4j-jcl-1.6.6.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/Users/pat/Projects/mahout/examples/target/dependency/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
12/06/29 11:58:55 INFO common.AbstractJob: Command line arguments: {--endPhase=[2147483647], --input=[b2/kmeans-clusters/clusteredPoints/part-m-00000], --startPhase=[0], --tempDir=[temp]}
2012-06-29 11:58:55.449 java[27127:1903] Unable to load realm info from SCDynamicStore
Input Path: b2/kmeans-clusters/clusteredPoints/part-m-00000
Key class: class org.apache.hadoop.io.IntWritable Value Class: class org.apache.mahout.clustering.classify.WeightedVectorWritable
Key: 832: Value: 0.02946182601035338: http://farfetchers.com/ = [2223:0.729, 2862:0.501, 3573:0.467]
Key: 819: Value: 0.03323576094647134: http://farfetchers.com/blog = [1:0.034, 9:0.021, 27:0.039, 28:0.026, 31:0.022, 33:0.032, 37:0.034, 38:0.022, 39:0.043, 44:0.029, 49:0.022, 51:0.025, 56:0.024, 60:0.029, 72:0.038, 74:0.020, 81:0.035, 82:0.037, 87:0.041, 89:0.033, 91:0.032, 104:0.034, 107:0.039, 112:0.034, 116:0.043, 121:0.017, 129:0.034, 136:0.035, 147:0.035, 148:0.031, 161:0.035,
On 6/29/12 11:08 AM, Pat Ferrel wrote:
Hmm, still the data in kmeans-clusters/clusteredPoints/part-m-00000 has all weights of 1.0

I checked to make sure the data was created with rebuilt code and that git knew the patched files were changed so the patch was included. I see the code in the IDE but I build with maven skipping tests. I looked through quite a few so can assume all are 1.0.

Maclaurin:big-data pat$ mahout seqdumper -i b2/kmeans-clusters/clusteredPoints/part-m-00000 | more
MAHOUT_LOCAL is set, so we don't add HADOOP_CONF_DIR to classpath.
MAHOUT_LOCAL is set, running locally
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/Users/pat/Projects/mahout/examples/target/mahout-examples-0.7-SNAPSHOT-job.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/Users/pat/Projects/mahout/examples/target/mahout-examples-0.8-SNAPSHOT-job.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/Users/pat/Projects/mahout/examples/target/dependency/slf4j-jcl-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/Users/pat/Projects/mahout/examples/target/dependency/slf4j-jcl-1.6.6.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/Users/pat/Projects/mahout/examples/target/dependency/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
12/06/29 10:58:16 INFO common.AbstractJob: Command line arguments: {--endPhase=[2147483647], --input=[b2/kmeans-clusters/clusteredPoints/part-m-00000], --startPhase=[0], --tempDir=[temp]}
2012-06-29 10:58:16.587 java[25768:1903] Unable to load realm info from SCDynamicStore
Input Path: b2/kmeans-clusters/clusteredPoints/part-m-00000
Key class: class org.apache.hadoop.io.IntWritable Value Class: class org.apache.mahout.clustering.classify.WeightedVectorWritable
Key: 792: Value: 1.0: http://farfetchers.com/ = [2223:0.729, 2862:0.501, 3573:0.467]
Key: 791: Value: 1.0: http://farfetchers.com/blog = [1:0.034, 9:0.021, 27:0.039, 28:0.026, 31:0.022, 33:0.032, 37:0.034, 38:0.022, 39:0.043, 44:0.029, 49:0.022, 51:0.025, 56:0.024, 60:0.029, 72:0.038, 74:0.020, 81:0.035, 82:0.037, 87:0.041, 89:0.033, 91:0.032, 104:0.034, 107:0.039, 112:0.034, 1


On 6/29/12 10:06 AM, Jeff Eastman wrote:
You were correct, the documented weights were not being set. I just uploaded a much smaller patch that fixes that. Please let me know if that works for you.

Jeff

On 6/29/12 12:27 PM, Pat Ferrel wrote:
OK. It's actually in the docs, MiA at least, that it will be 1 or 0 (never 0 in kmeans since the 0 docs are dropped from clusteredPoints).

I mention the patch only because it would be easy enough to put the pdf in the properties there if I knew where to look for it.

On 6/29/12 9:21 AM, Jeff Eastman wrote:
HMN, let me investigate this.


On 6/29/12 12:01 PM, Pat Ferrel wrote:

What is returned as the weight in the WeightedVectorWritable is pdfPerCluster.maxValue(), which is 1.0 for kmeans and so you cannot calculate the distance from this.

I'd fix this in the patch but I don't know where to find the actual pdf for kmeans since the one returned it is rounded to 1 or 0.











--------------060904070609020601050405-- --------------000007050108090909030603--