Return-Path: Delivered-To: apmail-mahout-dev-archive@www.apache.org Received: (qmail 32037 invoked from network); 9 Jun 2010 19:03:39 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 9 Jun 2010 19:03:39 -0000 Received: (qmail 66954 invoked by uid 500); 9 Jun 2010 19:03:38 -0000 Delivered-To: apmail-mahout-dev-archive@mahout.apache.org Received: (qmail 66913 invoked by uid 500); 9 Jun 2010 19:03:38 -0000 Mailing-List: contact dev-help@mahout.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@mahout.apache.org Delivered-To: mailing list dev@mahout.apache.org Received: (qmail 66905 invoked by uid 99); 9 Jun 2010 19:03:38 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 09 Jun 2010 19:03:38 +0000 X-ASF-Spam-Status: No, hits=2.2 required=10.0 tests=FREEMAIL_FROM,HTML_MESSAGE,RCVD_IN_DNSWL_NONE,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of jake.mannix@gmail.com designates 209.85.160.42 as permitted sender) Received: from [209.85.160.42] (HELO mail-pw0-f42.google.com) (209.85.160.42) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 09 Jun 2010 19:03:32 +0000 Received: by pwi4 with SMTP id 4so991340pwi.1 for ; Wed, 09 Jun 2010 12:03:09 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:mime-version:received:in-reply-to :references:from:date:message-id:subject:to:content-type; bh=jfql/+uj2yaKsA0bwXP3Mz5TRfvG9TE+xGd+FHKzndE=; b=auDbnI+RMjd0taRKOvMfhr0l5lAS6PBTKE61DQ8fowhEQQN/KXsbjvT6gSoIXQg0mF CiYQftaZoXC0XHtnV+vRaqOHOjdNv9z8f8r2ugXSOtpYo4QN9/noZVlXO429IMv0gKQ+ 9HV30/zA743tAf1yH6whqEWoG6mxbFeWr0nMw= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type; b=nVlN9kbJljxW9wOriKnJTmC75O/bF7LOwxXeAFxiG5NEPexlzAWmpd4OmOQQCbKyFh I21f0dn3JG/ypj4eWtrhkQ7i3mc3YgoAxDVOzo6UpS1nlFEd4U7OHpHDzAJqKM0w3ssG NVXKJsSH7sldyZ8rznHqRmo28dMTGW0zmXbQo= Received: by 10.114.215.12 with SMTP id n12mr14619201wag.68.1276110188215; Wed, 09 Jun 2010 12:03:08 -0700 (PDT) MIME-Version: 1.0 Received: by 10.115.32.8 with HTTP; Wed, 9 Jun 2010 12:02:48 -0700 (PDT) In-Reply-To: References: From: Jake Mannix Date: Wed, 9 Jun 2010 12:02:48 -0700 Message-ID: Subject: Re: CLI input formats and calling other jobs To: dev@mahout.apache.org Content-Type: multipart/alternative; boundary=0016e64b96da76802d04889d8ea2 X-Virus-Checked: Checked by ClamAV on apache.org --0016e64b96da76802d04889d8ea2 Content-Type: text/plain; charset=ISO-8859-1 Hi again Shannon, On Tue, Jun 8, 2010 at 10:31 PM, Shannon Quinn wrote: > > > I'm working on the spectral clustering algorithm, and inherent to that > strategy is the fact that the algorithm works on pairwise data point > comparisons, not the raw data itself. Put another way, if I want to > implement a clustering algorithm for Photoshop CS6's magic wand, the pixel > intensities constitute the raw data, but what my algorithm actually works > on > is a pairwise comparison of all the pixels (often a normalized form of > Euclidean distance with some KNN sprinkled in). > Yeah, the fact that you are really looking at a graph (aka matrix) in a more Markovian form makes this interesting, and I wonder whether there's something creative we can do other than just computing all the similarities. But for now, that makes sense as a starting point. > This preprocessing step - where n-dimensional data is converted to > matrix-form pairwise comparisons, or "similarities" - there are a lot of > options for how to implement this. My current strategy is for the algorithm > to accept data that has already been processed into similarity matrix form > and thereby leave the preprocessing to the user, but also to provide an > example using images and a script that converts the raw data to similarity > scores. Would this be a good setup? Any other suggestions? (I'd thought of > somehow adding a plugin architecture to allow for different implementations > of performing this similarity preprocessing, but I'm not sure where that > would mesh with the overall Mahout architecture, or that there would even > be > time this summer for it) > I think in the long run, having the ability to take either symmetric input (assume that it is already the matrix of similarities), or the "raw" input, would be nice. For now, whatever is easiest for you to work with should be fine. > Also, input data types: I've been using LDA and k-means as templates for my > work, and they make heavy use of SequenceFiles as their input format. > However, I'm unsure of how to further manipulate the data after using > seq2sparse on a CSV input file representing an NxN symmetric similarity > matrix, since k-means and LDA operate on text in all the available > examples. > Sticking with SequenceFile is the way to go, with the Writable type dicated by what it is you're doing - when they're vectors, use VectorWritable, and when you just need to pass around some coefficients, you can send IntWritable or your own custom writable. > My other questions revolve around clarifications on the codebase. Right now > I'm implementing a canonical k-means spectral clustering algorithm, which > if > you want greater detail on the theory you can read the post on my blog ( > > http://spectrallyclustered.wordpress.com/2010/06/05/sprint-1-k-means-spectral-clustering/ > ), > but the basic steps I'll reproduce here: > > 1) Convert raw data to similarity, or "affinity", scores -> matrix A (done > internally) > 2) Build diagonal matrix D of degree vertices (basically sum of rows of A) > Is there a more efficient way to do this step than the following: > > Matrix D = new SparseMatrix(cardinality); > for (int i = 0; i < A.numRows(); i++) { > D.set(i, i, A.getRow(i).zSum()); > } > So your matrix A is waaaaay too big to just fit in memory, right? So this code won't simply work on Big (or even Medium) Data. You need to write a MapReduce job which takes your SequenceFile input, and does what your inner loop effectively does. You can probably have a single Reducer which takes in all of the outputs of your Map job and build up a big Vector of results - you don't need a SparseMatrix, because it's diagonal, it can be represented as a (Dense) Vector. > 3) Construct a "normalized" Laplacian via the formula: L = D^(-1/2)AD^(-1/2) > I have this: > > Matrix L = D.times(A.times(D)); > Again, since A is a DistributedRowMatrix, there is a better way to compute this: if you look at what happens when you pre-multiply a matrix A by a diagonal matrix D, you're taking the column i of matrix A and multiplying it by d_ii. When you post-multiply A by D, you're taking the j'th *row* of A and multiplying by d_j. End result: L_ij = d_i a_ij d_j Where in your case, d_i = 1/sqrt(D_ii). Since D is just a single DenseVector, in the way we were representing it above, it can be held in memory in all of the Mappers, and so a single MapReduce job can compute the matrix L from A and D. > Since D is a diagonal matrix, the exponent operation is simply raising each > individual matrix element to the -0.5 power; I assume this involves > implementing a new UnaryFunction? Or does a method already exist (aside > from > manually looping over the matrix diagonal)? > Look at Functions.java for all of your UnaryFunction needs: import static org.apache.mahout.math.function.Functions.*; Vector d_inv_sqrt = d.assign(chain(sqrt, inv)); will do 1/sqrt() of all elements in a vector. > 4) Perform eigen-decomposition on L. > This is where DistributedLanczosSolver comes in, but I'm not sure how to > fire off the job from within another job since the run() method requires > command-line parameters. > Jeff is right - this should be refactored so that you can call it from inside of java. It can, really, but you need to do some setup: The meat of the algorithm is just the "solve" method. So if you're already set up with a (configure()'ed!) DistributedRowMatrix A which you want to decompose, then create a Matrix to store the eigenvectors, and create a List to store the eigenvalues, and do like Jeff said - just call: solver.solve(A, desiredRank, eVectors, eValues, true); Just make sure that you've called configure() on A before doing this. > 5) Perform k-means clustering on rows of matrix U consisting of column > eigenvectors of L. > The rows of U will be your projections of your original similarity matrix onto the reduce dimensional space (they'll be Dense!), so yeah, this makes sense, but I'm not sure whether you want to normalize by the inverse of the eigenvalues or not, first (do you want U, or S^-1 * U? - by definition of S, the dispersion of points in the reduced space are going to be highly clustered along the first few eigenvector directions if you don't normalize...) > Same problem as #4 - how to invoke the existing k-means job from within my > own job. > One note to add to Jeff's comment: your eigenvectors will live as the transpose of what you want for clustering, so you will need to instantiate a DistributedRowMatrix based on them (or, if they are small enough, just load the contents of the HDFS file into memory), and then call transpose(). The results of this are the thing you want to push into the kmeans job as input. Hope this helps more than confuses! -jake --0016e64b96da76802d04889d8ea2--