mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pat Ferrel <pat.fer...@gmail.com>
Subject Re: Need dev advice: SSVD - Clustering Pipeline
Date Wed, 12 Sep 2012 19:14:51 GMT
ok, need to refresh the trunk too so it may take a few.

On Sep 12, 2012, at 11:26 AM, Dmitriy Lyubimov <dlieu.7@gmail.com> wrote:

Ok i committed first round at
https://github.com/dlyubimov/mahout-commits/tree/MAHOUT-1067

Could you perhaps test named vector propagation with it? I did not
write any unit tests for named vector propagation yet and i need to
run now.

Note that api has changed to accomodate for USigma so you need to set
it in the api and use getUSigmaPath() after completion.

This issue is now tracked thru MAHOUT-1067.

-d

On Wed, Sep 12, 2012 at 9:55 AM, Pat Ferrel <pat.ferrel@gmail.com> wrote:
> This is my personally favored solution. I wish NamedVectors were used in RowSimilarity
too, and may submit a patch for it. If you output NamedVectors then they would enable the
RowSimilarity patch too.
> 
> If you want someone to do some ad hoc testing with real world data, I'm in. I'll follow
your github.
> 
> On Sep 12, 2012, at 9:42 AM, Dmitriy Lyubimov <dlieu.7@gmail.com> wrote:
> 
> I will file and work on a patch for SSVD to propagate named vectors
> (if present). This is trivial. + USigma output. Will publish in a few
> in my github.
> 
> On Wed, Sep 12, 2012 at 9:24 AM, Pat Ferrel <pat@occamsmachete.com> wrote:
>> There appears to be a gap in the pipeline SSVD-->Clustering. It can be patched
in a couple ways so can the devs please advise before we make a patch:
>> 
>> The Issues:
>> * There is currently no output from clustering that maps input vectors to clusters,
unless you input NamedVectors to clustering.
>> * SSVD does not output NamedVectors even if they are input.
>> 
>> Solutions:
>> 1. We could modify clustering to output in the file clusteredPoints/part-xxxx ID-Vector
pairs, Where IDs are Keys of the original input vectors and the Vector would be the original
input VectorWritable. This might be done by replacing the WeightedVectorWritable with a WeightedPropertyVectorWritable
and putting the ID in properties. This would require a change in the clustering classifier
but no change to SSVD or the rest of clustering. This would impact anyone using clusteredPoints
since they would have to deal with a new output vector type (actually wasn't this file using
WeightedPropertyVectorWritable before the mahout 0.7 refactoring?)
>> 2. We could alter SSVD to output NamedVectors and Clustering would simply pass them
through without modification as it does today. This would require a change to SSVD but not
to Clustering. Since NamedVectors seems to be the only way to perform this mapping now, there
would be very little impact on current users.
>> 
>> Afaict one of these has to be done and they are not mutually exclusive. Any advice?
>> 
> 


Mime
View raw message