mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jake Mannix <jake.man...@gmail.com>
Subject Re: Need a little help with using SVD
Date Tue, 22 Mar 2011 00:34:05 GMT
On Sun, Mar 20, 2011 at 5:27 PM, Timothy Potter <thelabdude@gmail.com>wrote:

> Hi Jake,
>
> I'm using a cluster with 8 data nodes (EC2 xlarge instances) with 3
> reducers
> per node and mapred.child.java.opts=-Xmx4096m. The map tasks completed
> within a few minutes but then all of my 24 reducers failed near the 70%
> mark
> with error "Task attempt_201103201840_0004_r_000023_0 failed to report
> status for 601 seconds. Killing!" The data node servers stayed at a healthy
> load avg below 4 and never paged ...
>

Weird.   The transpose job is totally trivial, but it may be that the
vectors you
are producing are abnormally large.  In fact, if this came from text, this
is
probably the case: very common words (which aren't stop words) occur
in most of your documents, and so are very very large.  Might they be
larger than your memory? 6-million entries, each one, when represented
sparsely (sadly, in this case), have an int and a float, so still less than
100MB,
so that's not it.

BTW: you need to make sure you use the same number of reducers as when
you made your eigenvector matrix, as that's a requirement for a map-side
join.   That should probably be documented!


> So I increased the "mapred.task.timeout" Hadoop parameter to 20 minutes
> instead of the default 10 minutes and it failed again. The reduce code for
> the TransposeJob looks straight-forward, so I'm going to have to dig in
> deeper to figure out what's causing the problem.
>

Yeah, if you can see where it's timing out, that would be great.  Not sure
why that task should ever take very long.

  -jake



> Cheers,
> Tim
>
> On Sat, Mar 19, 2011 at 3:34 PM, Jake Mannix <jake.mannix@gmail.com>
> wrote:
>
> > On Sat, Mar 19, 2011 at 10:32 AM, Timothy Potter <thelabdude@gmail.com
> >wrote:
> >
> >> Regarding Jake's comment: " ... you need to run the RowIdJob on these
> >> tfidf-vectors first ..."
> >>
> >> I did this and now have an m x n matrix T (m=6076937, n=20444). My SVD
> >> eigenvector matrix E is p x q (p=87, q=20444).
> >
> >
> > Ok, so to help you understand what's going on here, I'm going to go into
> > a little of the inner details of what's going on here.
> >
> > You are right, you have a matrix T, with 6,076,937 rows, and each row has
> > 20,444 columns (most of which are zero, and it's represented sparsely,
> but
> > still, they live in a vector space of dimension 20,444).  Similarly,
> you've
> > made an eigenvector matrix, which has 87 rows (ie 87 eigenvectors) and
> > each of these rows has exactly 20,444 columns (and most likely, they'll
> > all be nonzero, because eigenvectors have no reason to be sparse).
> >
> > In particular, T and E are represented as *lists of rows*, each row is a
> > vector of dimension 20,444.  T has six million of these rows, and E has
> > only 87.
> >
> >
> >> So to multiply these two
> >> matrices, I need to transpose E so that the number of columns in T
> equals
> >> the number of rows in E (i.e. E^T is q x p) the result of the matrixmult
> >> would give me an m x p matrix (m=6076937, p=87).
> >>
> >
> > You're exactly right that you want to multiply T by E^T, because you
> can't
> > compute T * E.
> >
> > The way it turns out in practice, computing the matrix product of two
> > matrices as a map-reduce job is efficiently done as a map-side join on
> > two row-based matrices with the same number of rows, and the columns
> > are the ones which are different.  In particular, if you take a matrix X
> > which
> > is represented as a set of numRowsX rows, each of which has numColsX,
> > and another matrix with numRowsY == numRowsX, each of which has
> > numColsY (!= numColsX), then by summing the outer-products of each
> > of the numRowsX pairs of vectors, you get a matrix of with numRowsZ ==
> > numColsX, and numColsZ == numColsY (if you instead take the reverse
> > outer product of the vector pairs, you can end up with the transpose of
> > this
> > final result, with numRowsZ == numColsY, and numColsZ == numColsX).
> >
> > Unfortunately, you have a pair of matrices which have different numbers
> > of rows, and the same number of columns, but you want a pair of matrices
> > with the same number of rows and (possibly) different numbers of columns.
> >
> >
> >> So I tried to run matrixmult with:
> >
> > matrixmult --numRowsA 6076937 --numColsA 20444 --numRowsB 20444
> --numColsB
> >> 87 \
> >> --inputPathA
> >> /asf-mail-archives/mahout-0.4/sparse-1-gram-stem/tfidf-matrix/matrix \
> >> --inputPathB /asf-mail-archives/mahout-0.4/svd/transpose-244
> >
> >
> >> (--inputPathA points to the output of the rowid job)
> >>
> >> This results in:
> >> Exception in thread "main" org.apache.mahout.math.CardinalityException:
> >>
> >
> >
> >> In the code, I see the test that row counts must be identical for the
> two
> >>
> >> input matrices. Thus, it seems like the job requires me to transpose
> this
> >> large matrix, just to re-transpose it back to it's original form during
> >> the
> >> multiplication? Or have I missed something crucial again?
> >>
> >
> > You actually need to transpose the input matrix (T), and then re-run with
> > T^t
> > and E^t (the latter you apparently already have created).
> >
> > We should really rename the "matrixmultiply" job to be called
> > "transposematrixmultiply", because that's what it really does.
> >
> >   -jake
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message