mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Pat Ferrel (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAHOUT-1030) Regression: Clustered Points Should be WeightedPropertyVectorWritable not WeightedVectorWritable
Date Fri, 24 Jan 2014 18:09:40 GMT

    [ https://issues.apache.org/jira/browse/MAHOUT-1030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13881226#comment-13881226
] 

Pat Ferrel commented on MAHOUT-1030:
------------------------------------

This fixes a very literal reading of the bug. The distance-squared is indeed included in clusteredPoints
BUT there are no vector ids so the distance can't actually be used. Without a vector id in
clusteredPoints, Mahout doesn't really perform unsupervised categorization. I will now have
to loop through all vectors, recalculate the distance and categorize them according to the
cluster centroid they are closest to. 

The clusteredPoints and distance-squared can't actually be used without knowing the vector
id. I think named vectors work here but many cases including mine do not have names only Mahout
integer ids.

Please correct me if I've missed something.

When I cluster the user preference data used in the Mahout recommender I get clusteredPoints
something like this. The data from the vector is given but not its id??? The Key here is a
cluster id.

pat$ mahout seqdumper -i /Users/pat/big-data/temp/clusters/clusteredPoints/ | more
Jan 24, 2014 10:02:05 AM org.slf4j.impl.JCLLoggerAdapter info
INFO: Command line arguments: {--endPhase=[2147483647], --input=[/Users/pat/big-data/temp/clusters/clusteredPoints/],
--startPhase=[0], --tempDir=[temp]}
2014-01-24 10:02:05.707 java[29221:1003] Unable to load realm info from SCDynamicStore
Input Path: file:/Users/pat/big-data/temp/clusters/clusteredPoints/part-m-00000
Key class: class org.apache.hadoop.io.IntWritable Value Class: class org.apache.mahout.clustering.classify.WeightedPropertyVectorWritable
Key: 39: Value: wt: 1.0 distance-squared: 9.656875  vec: [0:1.000, 2:1.000, 5:1.000, 9:1.000,
12:1.000, 13:1.000, 17:1.000, 18:1.000, 19:1.000, 20:1.000]
Key: 48: Value: wt: 1.0 distance-squared: 22.229166666666686  vec: [25:1.000, 26:1.000, 27:1.000,
28:1.000, 29:1.000, 30:1.000, 31:1.000, 36:1.000, 38:1.000, 39:1.000, 40:1.000, 41:1.000,
43:1.000, 44:1.000, 46:1.000, 48:1.000, 53:1.000, 54:1.000, 55:1.000, 56:1.000, 57:1.000,
58:1.000, 60:1.000, 63:1.000, 64:1.000, 66:1.000, 67:1.000, 68:1.000, 69:1.000, 70:1.000,
71:1.000, 72:1.000]


> Regression: Clustered Points Should be WeightedPropertyVectorWritable not WeightedVectorWritable
> ------------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-1030
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1030
>             Project: Mahout
>          Issue Type: Bug
>          Components: Clustering, Integration
>    Affects Versions: 0.7
>            Reporter: Jeff Eastman
>            Assignee: Andrew Musselman
>             Fix For: 0.9
>
>         Attachments: MAHOUT-1030.patch, MAHOUT-1030.patch, MAHOUT-1030.patch, MAHOUT-1030.patch,
MAHOUT-1030.patch, MAHOUT-1030.patch, MAHOUT-1030.patch
>
>
> Looks like this won't make it into this build. Pretty widespread impact on code and tests
and I don't know which properties were implemented in the old version. I will create a JIRA
and post my interim results.
> On 6/8/12 12:21 PM, Jeff Eastman wrote:
> > That's a reversion that evidently got in when the new ClusterClassificationDriver
was introduced. It should be a pretty easy fix and I will see if I can make the change before
Paritosh cuts the release bits tonight.
> >
> > On 6/7/12 1:00 PM, Pat Ferrel wrote:
> >> It appears that in kmeans the clusteredPoints are now written as WeightedVectorWritable
where in mahout 0.6 they were WeightedPropertyVectorWritable? This means that the distance
from the centroid is no longer stored here? Why? I hope I'm wrong because that is not a welcome
change. How is one to order clustered docs by distance from cluster centroid?
> >>
> >> I'm sure I could calculate the distance but that would mean looking up the centroid
for the cluster id given in the above WeightedVectorWritable, which means iterating through
all the clusters for each clustered doc. In my case the number of clusters could be fairly
large.
> >>
> >> Am I missing something?
> >>
> >>
> >



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Mime
View raw message