Mailing-List: contact dev-help@mahout.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@mahout.apache.org
Received-SPF: pass (nike.apache.org: domain of jake.mannix@gmail.com
 designates 209.85.160.42 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=mime-version:in-reply-to:references:from:date:message-id:subject:to
         :content-type;
        b=nVlN9kbJljxW9wOriKnJTmC75O/bF7LOwxXeAFxiG5NEPexlzAWmpd4OmOQQCbKyFh
         I21f0dn3JG/ypj4eWtrhkQ7i3mc3YgoAxDVOzo6UpS1nlFEd4U7OHpHDzAJqKM0w3ssG
         NVXKJsSH7sldyZ8rznHqRmo28dMTGW0zmXbQo=
MIME-Version: 1.0
In-Reply-To: <AANLkTin-Qg7w9U1uDES68VVH2TbT-l2urPJp6WntS79T@mail.gmail.com>
References: <AANLkTin-Qg7w9U1uDES68VVH2TbT-l2urPJp6WntS79T@mail.gmail.com>
From: Jake Mannix <jake.mannix@gmail.com>
Date: Wed, 9 Jun 2010 12:02:48 -0700
Message-ID: <AANLkTimnRzSM3B_a1QLYytFlLz3BpAYmPwR-wAAzwGXG@mail.gmail.com>
Subject: Re: CLI input formats and calling other jobs
To: dev@mahout.apache.org
Content-Type: multipart/alternative; boundary=0016e64b96da76802d04889d8ea2

--0016e64b96da76802d04889d8ea2
Content-Type: text/plain; charset=ISO-8859-1

Hi again Shannon,

On Tue, Jun 8, 2010 at 10:31 PM, Shannon Quinn <squinn@gatech.edu> wrote:
>
>
> I'm working on the spectral clustering algorithm, and inherent to that
> strategy is the fact that the algorithm works on pairwise data point
> comparisons, not the raw data itself. Put another way, if I want to
> implement a clustering algorithm for Photoshop CS6's magic wand, the pixel
> intensities constitute the raw data, but what my algorithm actually works
> on
> is a pairwise comparison of all the pixels (often a normalized form of
> Euclidean distance with some KNN sprinkled in).
>

Yeah, the fact that you are really looking at a graph (aka matrix) in
a more Markovian form makes this interesting, and I wonder whether
there's something creative we can do other than just computing all the
similarities.  But for now, that makes sense as a starting point.


> This preprocessing step - where n-dimensional data is converted to
> matrix-form pairwise comparisons, or "similarities" - there are a lot of
> options for how to implement this. My current strategy is for the algorithm
> to accept data that has already been processed into similarity matrix form
> and thereby leave the preprocessing to the user, but also to provide an
> example using images and a script that converts the raw data to similarity
> scores. Would this be a good setup? Any other suggestions? (I'd thought of
> somehow adding a plugin architecture to allow for different implementations
> of performing this similarity preprocessing, but I'm not sure where that
> would mesh with the overall Mahout architecture, or that there would even
> be
> time this summer for it)
>

I think in the long run, having the ability to take either symmetric input
(assume that it is already the matrix of similarities), or the "raw" input,
would be nice.  For now, whatever is easiest for you to work with should be
fine.


> Also, input data types: I've been using LDA and k-means as templates for my
> work, and they make heavy use of SequenceFiles as their input format.
> However, I'm unsure of how to further manipulate the data after using
> seq2sparse on a CSV input file representing an NxN symmetric similarity
> matrix, since k-means and LDA operate on text in all the available
> examples.
>

Sticking with SequenceFile is the way to go, with the Writable type dicated
by what it is you're doing - when they're vectors, use VectorWritable, and
when you just need to pass around some coefficients, you can send
IntWritable or your own custom writable.


> My other questions revolve around clarifications on the codebase. Right now
> I'm implementing a canonical k-means spectral clustering algorithm, which
> if
> you want greater detail on the theory you can read the post on my blog (
>
> http://spectrallyclustered.wordpress.com/2010/06/05/sprint-1-k-means-spectral-clustering/
> ),
> but the basic steps I'll reproduce here:
>
> 1) Convert raw data to similarity, or "affinity", scores -> matrix A (done
> internally)
> 2) Build diagonal matrix D of degree vertices (basically sum of rows of A)
> Is there a more efficient way to do this step than the following:
>
> Matrix D = new SparseMatrix(cardinality);
> for (int i = 0; i < A.numRows(); i++) {
> D.set(i, i, A.getRow(i).zSum());
> }
>

So your matrix A is waaaaay too big to just fit in memory, right?  So
this code won't simply work on Big (or even Medium) Data.

You need to write a MapReduce job which takes your
SequenceFile<IntWritable,VectorWritable> input, and does what
your inner loop effectively does.  You can probably have a single
Reducer which takes in all of the outputs of your Map job and build
up a big Vector of results - you don't need a SparseMatrix, because
it's diagonal, it can be represented as a (Dense) Vector.


> 3) Construct a "normalized" Laplacian via the formula:

L = D^(-1/2)AD^(-1/2)
> I have this:
>
> Matrix L = D.times(A.times(D));
>

Again, since A is a DistributedRowMatrix, there is a better way
to compute this: if you look at what happens when you pre-multiply
a matrix A by a diagonal matrix D, you're taking the column i of
matrix A and multiplying it by d_ii.  When you post-multiply A
by D, you're taking the j'th *row* of A and multiplying by d_j.

End result: L_ij = d_i a_ij d_j

Where in your case, d_i = 1/sqrt(D_ii).

Since D is just a single DenseVector, in the way we were
representing it above, it can be held in memory in all of the
Mappers, and so a single MapReduce job can compute the
matrix L from A and D.


> Since D is a diagonal matrix, the exponent operation is simply raising each
> individual matrix element to the -0.5 power; I assume this involves
> implementing a new UnaryFunction? Or does a method already exist (aside
> from
> manually looping over the matrix diagonal)?
>

Look at Functions.java for all of your UnaryFunction needs:

  import static org.apache.mahout.math.function.Functions.*;
  Vector d_inv_sqrt = d.assign(chain(sqrt, inv));

will do 1/sqrt() of all elements in a vector.


> 4) Perform eigen-decomposition on L.
> This is where DistributedLanczosSolver comes in, but I'm not sure how to
> fire off the job from within another job since the run() method requires
> command-line parameters.
>

Jeff is right - this should be refactored so that you can call it from
inside
of java.  It can, really, but you need to do some setup:

The meat of the algorithm is just the "solve" method.  So if you're
already set up with a (configure()'ed!) DistributedRowMatrix A which
you want to decompose, then create a Matrix to store the
eigenvectors, and create a List<Double> to store the eigenvalues,
and do like Jeff said - just call:

  solver.solve(A, desiredRank, eVectors, eValues, true);

Just make sure that you've called configure() on A before doing this.


> 5) Perform k-means clustering on rows of matrix U consisting of column
> eigenvectors of L.
>

The rows of U will be your projections of your original similarity matrix
onto the reduce dimensional space (they'll be Dense!), so yeah, this makes
sense, but I'm not sure whether you want to normalize by the inverse
of the eigenvalues or not, first (do you want U, or S^-1 * U? - by
definition of S, the dispersion of points in the reduced space are going
to be highly clustered along the first few eigenvector directions if
you don't normalize...)


> Same problem as #4 - how to invoke the existing k-means job from within my
> own job.
>

One note to add to Jeff's comment: your eigenvectors will live as the
transpose of what you want for clustering, so you will need to instantiate
a DistributedRowMatrix based on them (or, if they are small enough,
just load the contents of the HDFS file into memory), and then call
transpose().  The results of this are the thing you want to push into
the kmeans job as input.

Hope this helps more than confuses!

  -jake

--0016e64b96da76802d04889d8ea2--