mahout-dev mailing list archives

Site index · List index
Message view
Top
From Shannon Quinn <squ...@gatech.edu>
Subject Re: [jira] [Commented] (MAHOUT-524) DisplaySpectralKMeans example fails
Date Tue, 24 May 2011 21:09:52 GMT
```You're right, that would give you the affinity matrix. However, the affinity
matrix is an easier beast to tame since the matrix is constructed with all
the points' orders preserved: aff[i][j] is the relationship between
original_point[i] and original_point[j], so for all practical purposes I
treat this as the "original data" (since it's easy to go back and forth
between the two).

Problem is, I'm not sure if the Lanczos solver or K-Means preserve this
ordering of indices. Does the nth point with label y from the result of
K-means correspond to the nth row of the column matrix of eigenvectors? If
so, then does that nth row from the eigenvector matrix also correspond to
the nth original data point (the one represented by proxy by row n and
column n of the affinity matrix)? If both these conditions are true, then
and only then can we say that original_point[n]'s cluster is y.

On Tue, May 24, 2011 at 4:39 PM, Jeff Eastman <jeastman@narus.com> wrote:

> Would that give you the original data matrix, the clustered data matrix, or
> the clustered affinity matrix? Even with the analogy in mind I'm having
> trouble connecting the dots. Seems like I lost the original data matrix in
> step 1 when I used a distance measure to produce A from it. If the returned
> eigenvectors define Q, then what is the significance of QAQ^-1? And, more
> importantly, if the Q eigenvectors define the clusters in eigenspace, what
> is the inverse transformation?
>
> -----Original Message-----
> From: squinn.squinn@gmail.com [mailto:squinn.squinn@gmail.com] On Behalf
> Of Shannon Quinn
> Sent: Tuesday, May 24, 2011 12:07 PM
> To: dev@mahout.apache.org
> Subject: Re: [jira] [Commented] (MAHOUT-524) DisplaySpectralKMeans example
> fails
>
> That's an excellent analogy! Employing that strategy, would it be possible
> (and not too expensive) to do the QAQ^-1 operation to get the original data
> matrix, after we've clustered the points in eigenspace?
>
> On Tue, May 24, 2011 at 2:59 PM, Jeff Eastman <jeastman@narus.com> wrote:
>
> > For the display example, it is not necessary to cluster the original
> > points. The other clustering display examples only train the clusters and
> do
> > not classify the points. They are drawn first and the cluster centers &
> > radii are superimposed afterwards. Thus I think it is only necessary to
> > back-transform the clusters.
> >
> > My EE gut tells me this is like Fourier transforms between time- and
> > frequency-domains. If this is true then what we need is the inverse
> > transform. Is this a correct analogy?
> >
> > -----Original Message-----
> > From: squinn.squinn@gmail.com [mailto:squinn.squinn@gmail.com] On Behalf
> > Of Shannon Quinn
> > Sent: Tuesday, May 24, 2011 11:39 AM
> > To: dev@mahout.apache.org
> > Subject: Re: [jira] [Commented] (MAHOUT-524) DisplaySpectralKMeans
> example
> > fails
> >
> > This is actually something I could use a little expert Hadoop assistance
> > on.
> > The general idea is that the points that are clustered in eigenspace have
> a
> > 1-to-1 correspondence with the original points (which is how you get your
> > cluster assignments), but this back-mapping after clustering isn't
> > explicitly implemented yet, since that's the core of the IO issue.
> >
> > My block on this is my lack of understanding in how the actual ordering
> of
> > the points change (or not?) from when they are projected into eigenspace
> > (the Lanczos solver) and when K-means makes its cluster assignments. On a
> > one-node setup the original ordering appears to be preserved through all
> > the
> > operations, so the labels of the original points can be assigned by
> giving
> > original_point[i] the label of projected_point[i], hence the cluster
> > assignments are easy to determine. For multi-node setups, however, I
> simply
> > don't know if this heuristic holds.
> >
> > But I believe the immediate issue here is that we're feeding the
> projected
> > points to the display, when it should be the original points *annotated*
> > with the cluster assignments from the corresponding projected points. The
> > question is how to shift those assignments over robustly; right now it's
> > just a hack job in the SpectralKMeansDriver...or maybe (hopefully!) it's
> > just the version I have locally :o)
> >
> > On Tue, May 24, 2011 at 2:13 PM, Jeff Eastman <jeastman@narus.com>
> wrote:
> >
> > > Yes, I expect it is pilot error on my part. The original implementation
> > was
> > > failing in this manner because I was requesting 5 eigenvectors
> > (clusters). I
> > > changed it to 2 and now it displays something but it is not even close
> to
> > > correct. I think this is because I have not transformed back from eigen
> > > space to vector space. This all relates to the IO issue for the
> spectral
> > > clustering code which I don't grok.
> > >
> > > The display driver begins with the sample points and generates the
> > affinity
> > > matrix using a distance measure. Not clear this is even a correct
> > > interpretation of that matrix. Then spectral kmeans runs and produces 2
> > > clusters which I display directly. Seems like this number should be
> more
> > > like the k in kmeans, and 5 was more realistic given the data. I
> believe
> > > there is a missing output transformation to recover the clusters from
> the
> > > eigenvectors but I don't know how to do that.
> > >
> > > I bet you do :)
> > >
> > > -----Original Message-----
> > > From: Shannon Quinn (JIRA) [mailto:jira@apache.org]
> > > Sent: Tuesday, May 24, 2011 8:07 AM
> > > To: dev@mahout.apache.org
> > > Subject: [jira] [Commented] (MAHOUT-524) DisplaySpectralKMeans example
> > > fails
> > >
> > >
> > >    [
> > >
> >
> https://issues.apache.org/jira/browse/MAHOUT-524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13038608#comment-13038608
> > ]
> > >
> > > Shannon Quinn commented on MAHOUT-524:
> > > --------------------------------------
> > >
> > > +1, I'm on it.
> > >
> > > I'm a little unclear as to the context of the initial Hudson comment:
> the
> > > display method is expecting 2D vectors, but getting 5D ones?
> > >
> > > > DisplaySpectralKMeans example fails
> > > > -----------------------------------
> > > >
> > > >                 Key: MAHOUT-524
> > > >                 URL:
> https://issues.apache.org/jira/browse/MAHOUT-524
> > > >             Project: Mahout
> > > >          Issue Type: Bug
> > > >          Components: Clustering
> > > >    Affects Versions: 0.4, 0.5
> > > >            Reporter: Jeff Eastman
> > > >            Assignee: Jeff Eastman
> > > >              Labels: clustering, k-means, visualization
> > > >             Fix For: 0.6
> > > >
> > > >         Attachments: aff.txt, raw.txt, spectralkmeans.png
> > > >
> > > >
> > > > I've committed a new display example that attempts to push the
> standard
> > > mixture of models data set through spectral k-means. After some
> tweaking
> > of
> > > configuration arguments and a bug fix in EigenCleanupJob it runs
> spectral
> > > k-means to completion. The display example is expecting 2-d clustered
> > points
> > > and the example is producing 5-d points. Additional I/O work is needed
> > > before this will play with the rest of the clustering algorithms.
> > >
> > > --
> > > This message is automatically generated by JIRA.