mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jake Mannix <>
Subject Re: Selectively discarding EigenVerification results and clustering assignments
Date Thu, 24 Jun 2010 17:35:35 GMT
Hey Shannon,

  I don't think that the EigenVerificationJob *modifies* any SequenceFiles -
that's a big no-no in Hadoop-land (data is write-once).  The output path for
the cleaned eigenvectors is "${mapred.output.dir}/largestCleanEigens/" -
look in EigenVerificationJob.saveCleanEigens().  It will give you as many
cleaned eigenvectors as it can get out of the ones that you gave it (ie.
every eigenvector which has error less than maxError, and eigenvalue greater
than minEigenvalue will be kept).

  If you wanted to add a parameter to that job "maxEigensToKeep", which
would prune off the smallest eigenvectors of the remaining cleaned set and
keep only that value, it would be a nice addition.

  I'm not exactly sure what you're asking about the cluster dumping...


On Thu, Jun 24, 2010 at 5:03 PM, Shannon Quinn <> wrote:

> Hi all,
> Hopefully these two questions will be my last, at least until my next
> sprint... :)
> I've run the EigenVerification task, and from what I can tell it modifies
> the SequenceFiles themselves that contain the results of the LanczosSolver.
> My first question is fairly straightforward: since I need to do as Jake
> suggested earlier - set my desiredRank for the LanczosSolver as 1.2-1.5
> times what I actually want, then discard the highest-order eigenvectors down
> to exactly desiredRank - how do I actually perform the discard of the extra
> rows in the SequenceFiles? I tried making a DistributedRowMatrix out of the
> results and hard-setting the number of rows, but all the rows written by the
> LanczosSolver showed up.
> Part of this spectral clustering is to use the components of the
> eigenvectors as proxies for the real data, so after I've performed k-means
> clustering, I need to be able to read the cluster assignments
> programmatically, and transfer those assignments back to the original data.
> I know of the clusterdump tool, but to be honest I'm having trouble
> interpreting its output, plus I'm unsure of how I would output the cluster
> assignments from my program. It would seem, for compatibility purposes, that
> the format of clusterdump would be ideal, but I'm not sure how to do this
> when I'm proxying the cluster assignments. Any thoughts on this would be
> wonderful.
> Thank you!
> Shannon

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message