Return-Path: X-Original-To: apmail-mahout-dev-archive@www.apache.org Delivered-To: apmail-mahout-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 4BA7FD4C8 for ; Thu, 28 Jun 2012 23:16:46 +0000 (UTC) Received: (qmail 47392 invoked by uid 500); 28 Jun 2012 23:16:45 -0000 Delivered-To: apmail-mahout-dev-archive@mahout.apache.org Received: (qmail 47198 invoked by uid 500); 28 Jun 2012 23:16:45 -0000 Mailing-List: contact dev-help@mahout.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@mahout.apache.org Delivered-To: mailing list dev@mahout.apache.org Received: (qmail 46844 invoked by uid 99); 28 Jun 2012 23:16:44 -0000 Received: from issues-vm.apache.org (HELO issues-vm) (140.211.11.160) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 28 Jun 2012 23:16:44 +0000 Received: from isssues-vm.apache.org (localhost [127.0.0.1]) by issues-vm (Postfix) with ESMTP id 2D733142851 for ; Thu, 28 Jun 2012 23:16:44 +0000 (UTC) Date: Thu, 28 Jun 2012 23:16:44 +0000 (UTC) From: "Pat Ferrel (JIRA)" To: dev@mahout.apache.org Message-ID: <847011859.69593.1340925404187.JavaMail.jiratomcat@issues-vm> In-Reply-To: <1766791136.451.1339258183466.JavaMail.jiratomcat@issues-vm> Subject: [jira] [Commented] (MAHOUT-1030) Regression: Clustered Points Should be WeightedPropertyVectorWritable not WeightedVectorWritable MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/MAHOUT-1030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13403570#comment-13403570 ] Pat Ferrel commented on MAHOUT-1030: ------------------------------------ Jeff said; "It is trivial to back-calculate the distance from the pdf value that is already written in the WeightedVectorWritable." Hmm, I don't see how the pdf is included in the WeightedVectorWritable, at least for kmeans. The weight therein is the probability it belongs to the cluster, for kmeans 1 or 0, right? clusteredPoints/part-m-00000 is made of IntWritable clusterIDs and WeightedVectorWritable so how do you find a non-binary pdf for a specific doc? > Regression: Clustered Points Should be WeightedPropertyVectorWritable not WeightedVectorWritable > ------------------------------------------------------------------------------------------------ > > Key: MAHOUT-1030 > URL: https://issues.apache.org/jira/browse/MAHOUT-1030 > Project: Mahout > Issue Type: Bug > Components: Clustering, Integration > Affects Versions: 0.7 > Reporter: Jeff Eastman > Assignee: Jeff Eastman > Fix For: 0.8 > > Attachments: MAHOUT-1030.patch > > > Looks like this won't make it into this build. Pretty widespread impact on code and tests and I don't know which properties were implemented in the old version. I will create a JIRA and post my interim results. > On 6/8/12 12:21 PM, Jeff Eastman wrote: > > That's a reversion that evidently got in when the new ClusterClassificationDriver was introduced. It should be a pretty easy fix and I will see if I can make the change before Paritosh cuts the release bits tonight. > > > > On 6/7/12 1:00 PM, Pat Ferrel wrote: > >> It appears that in kmeans the clusteredPoints are now written as WeightedVectorWritable where in mahout 0.6 they were WeightedPropertyVectorWritable? This means that the distance from the centroid is no longer stored here? Why? I hope I'm wrong because that is not a welcome change. How is one to order clustered docs by distance from cluster centroid? > >> > >> I'm sure I could calculate the distance but that would mean looking up the centroid for the cluster id given in the above WeightedVectorWritable, which means iterating through all the clusters for each clustered doc. In my case the number of clusters could be fairly large. > >> > >> Am I missing something? > >> > >> > > -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira