mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Vincent Xue <xue....@gmail.com>
Subject Re: Transposing a matrix is limited by how large a node is.
Date Fri, 13 May 2011 12:42:47 GMT
The Lanczos implementation of SVD worked very well with my dense
matrix. I ran several iterations to confirm that I had the the top 3
eigen vectors of my matrix and used these vectors to visualize the top
principal components of my data.

As for the transpose code, I believe that the last part of the code
could benefit from some feedback. In my implementation I am spawning
multiple jobs, for as many splits as needed, so that a single node
will not run out of disk space. The last step calls for a sequential
combination of the pieces into one sequence file which is probably a
bad approach. I am sequentially combining the pieces because I want to
use the output in other mahout jobs.

Instead of running this slow process, I was thinking that it would be
better to keep the output in separate large chunks, and perform
further jobs with Hadoop's MultiFileInputFormat. The problem with this
however once a matrix is split, I do not know of any way to use the
split sequence files in other Mahout jobs, other than writing
dedicated Java code specifying the multi input files to the job.

My questions are:
What would be the preferred way of storing large matrices, or even
files on the HDFS?
Is it efficient to perform many small mapred jobs on the same matrix?
(considering that jobs are moving and the data isn't)

-Vincent

On Fri, May 6, 2011 at 4:18 PM, Ted Dunning <ted.dunning@gmail.com> wrote:
>
> If you have the code and would like to contribute it, file a JIRA and attach
> a patch.
>
> It will be interesting to hear how the SVD proceeds.  Such a large dense
> matrix is an unusual target for SVD.
>
> Also, it is possible to adapt the R version of random projection to never
> keep all of the large matrix in memory.  Instead, only slices of the matrix
> are kept and the multiplications involved are done progressively.  The
> results are kept in memory, but not the large matrix.  This would probably
> make your sequential version fast enough to use.  R may not be usable unless
> it can read the portions of your large matrix quickly using binary I/O.
>
> Also, I suspect that you are trying to get the transpose in order to
> decompose A' A.  This is not necessary as far as I can tell since you can
> simply decompose A and use that to compute the decomposition of A' A even
> faster than you can compute the decomposition of A itself.
>
> On Fri, May 6, 2011 at 7:36 AM, Vincent Xue <xue.vin@gmail.com> wrote:
>
> > Because I am limited by my resources, I  coded up a slower but effective
> > implementation of the transpose job that I could share. It avoids loading
> > all the data on to one node by transposing the matrix in pieces. The
> > slowest
> > part of this is combining the pieces back to one matrix. :(
> >

Mime
View raw message