mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dmitriy Lyubimov <dlie...@gmail.com>
Subject Re: Need dev advice: SSVD - Clustering Pipeline
Date Thu, 13 Sep 2012 03:59:03 GMT
Yeah. I see. I was under impression that general distributed matrix
contract was leaning towards identifying rows by sequence file keys.

The reasoning is that if they were identified by named vector, it would
have required to have named vector. But it doesnt.

On the other hand, drm always requires sequence file keys.

So relying on named vector contract  is not fool proof as we have
discovered.
On Sep 12, 2012 8:55 PM, "Pat Ferrel" <pat.ferrel@gmail.com> wrote:

> To be clear this change only affects classification of the input vectors.
> Everything else in clustering works fine without it. I need to know which
> vectors are in which clusters, it is why I run clustering, for its
> classification function. There will be many who don't care about
> classification.
>
> On Sep 12, 2012, at 8:27 PM, Pat Ferrel <pat.ferrel@gmail.com> wrote:
>
> Yes, you have output but it is only partly useful.
>
> There are two things created during clustering:
> Clusters, which are basically centroids and their vectors
> If you ask the driver to classify your input into clusters, you get
> clusteredPoints
> Both of these are created, even without NamedVectors. The clusters
> centroids are quite alright with non-NamedVectors as input. However though
> clusteredPoints is created there is no way to tell which vectors are
> classified by cluster since all you get is anonymous weights in the
> vectors. How can you tell which doc was in which cluster?
>
> Creating a new classifier that would attach vector IDs when there is no
> NamedVector is my #2 solution below.
>
> So yes, it still runs and produces clusters but in my application and I
> suspect quite a few others, the cluster is only of interest if the input is
> classified into the clusters.
>
> On Sep 12, 2012, at 7:07 PM, Dmitriy Lyubimov <dlieu.7@gmail.com> wrote:
>
> I am curious though.
>
> do you really have no cluster output unless Named vectors are used?
>
> It is strange because even if I did not use Named vectors, i would
> still expect for for clusters to form correctly, with the cluster ids
> and points and top terms. So cluster dumper should still produce
> document vectors (even if without original name) and top terms, i.e.
> clustered points should not be empty. After all, I am not obliged to
> follow text analysis pipeline as in the MIA, i might as well come up
> with my own DRM i would like to find clusters for; and i might not
> have used text labels in that matrix..
>
> On Wed, Sep 12, 2012 at 9:24 AM, Pat Ferrel <pat@occamsmachete.com> wrote:
> > There appears to be a gap in the pipeline SSVD-->Clustering. It can be
> patched in a couple ways so can the devs please advise before we make a
> patch:
> >
> > The Issues:
> >  * There is currently no output from clustering that maps input vectors
> to clusters, unless you input NamedVectors to clustering.
> >  * SSVD does not output NamedVectors even if they are input.
> >
> > Solutions:
> >  1. We could modify clustering to output in the file
> clusteredPoints/part-xxxx ID-Vector pairs, Where IDs are Keys of the
> original input vectors and the Vector would be the original input
> VectorWritable. This might be done by replacing the WeightedVectorWritable
> with a WeightedPropertyVectorWritable and putting the ID in properties.
> This would require a change in the clustering classifier but no change to
> SSVD or the rest of clustering. This would impact anyone using
> clusteredPoints since they would have to deal with a new output vector type
> (actually wasn't this file using WeightedPropertyVectorWritable before the
> mahout 0.7 refactoring?)
> >  2. We could alter SSVD to output NamedVectors and Clustering would
> simply pass them through without modification as it does today. This would
> require a change to SSVD but not to Clustering. Since NamedVectors seems to
> be the only way to perform this mapping now, there would be very little
> impact on current users.
> >
> > Afaict one of these has to be done and they are not mutually exclusive.
> Any advice?
> >
>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message