Hi!
I got exactly the same error when running alternating least squares.
You need to setup the task timeout and expiry interval timeouts.
Exact details are found in my blog:
http://bickson.blogspot.com/2011/03/tunninghadoopconfigurationforhigh.html
Best,
 Danny Bickson
On Sun, Mar 20, 2011 at 8:27 PM, Timothy Potter <thelabdude@gmail.com>wrote:
> Hi Jake,
>
> Thank you for the detailed explanation; seems like a very clever way to
> distribute the matrix multiplication process.
>
> I tried running the transpose job on my T matrix using the following
> options:
>
> bin/mahout transpose i
> /asfmailarchives/mahout0.4/sparse1gramstem/tfidfmatrix/matrix
> numRows 6076937 numCols 20444
>
> I'm using a cluster with 8 data nodes (EC2 xlarge instances) with 3
> reducers
> per node and mapred.child.java.opts=Xmx4096m. The map tasks completed
> within a few minutes but then all of my 24 reducers failed near the 70%
> mark
> with error "Task attempt_201103201840_0004_r_000023_0 failed to report
> status for 601 seconds. Killing!" The data node servers stayed at a healthy
> load avg below 4 and never paged ...
>
> So I increased the "mapred.task.timeout" Hadoop parameter to 20 minutes
> instead of the default 10 minutes and it failed again. The reduce code for
> the TransposeJob looks straightforward, so I'm going to have to dig in
> deeper to figure out what's causing the problem.
>
> Cheers,
> Tim
>
> On Sat, Mar 19, 2011 at 3:34 PM, Jake Mannix <jake.mannix@gmail.com>
> wrote:
>
> > On Sat, Mar 19, 2011 at 10:32 AM, Timothy Potter <thelabdude@gmail.com
> >wrote:
> >
> >> Regarding Jake's comment: " ... you need to run the RowIdJob on these
> >> tfidfvectors first ..."
> >>
> >> I did this and now have an m x n matrix T (m=6076937, n=20444). My SVD
> >> eigenvector matrix E is p x q (p=87, q=20444).
> >
> >
> > Ok, so to help you understand what's going on here, I'm going to go into
> > a little of the inner details of what's going on here.
> >
> > You are right, you have a matrix T, with 6,076,937 rows, and each row has
> > 20,444 columns (most of which are zero, and it's represented sparsely,
> but
> > still, they live in a vector space of dimension 20,444). Similarly,
> you've
> > made an eigenvector matrix, which has 87 rows (ie 87 eigenvectors) and
> > each of these rows has exactly 20,444 columns (and most likely, they'll
> > all be nonzero, because eigenvectors have no reason to be sparse).
> >
> > In particular, T and E are represented as *lists of rows*, each row is a
> > vector of dimension 20,444. T has six million of these rows, and E has
> > only 87.
> >
> >
> >> So to multiply these two
> >> matrices, I need to transpose E so that the number of columns in T
> equals
> >> the number of rows in E (i.e. E^T is q x p) the result of the matrixmult
> >> would give me an m x p matrix (m=6076937, p=87).
> >>
> >
> > You're exactly right that you want to multiply T by E^T, because you
> can't
> > compute T * E.
> >
> > The way it turns out in practice, computing the matrix product of two
> > matrices as a mapreduce job is efficiently done as a mapside join on
> > two rowbased matrices with the same number of rows, and the columns
> > are the ones which are different. In particular, if you take a matrix X
> > which
> > is represented as a set of numRowsX rows, each of which has numColsX,
> > and another matrix with numRowsY == numRowsX, each of which has
> > numColsY (!= numColsX), then by summing the outerproducts of each
> > of the numRowsX pairs of vectors, you get a matrix of with numRowsZ ==
> > numColsX, and numColsZ == numColsY (if you instead take the reverse
> > outer product of the vector pairs, you can end up with the transpose of
> > this
> > final result, with numRowsZ == numColsY, and numColsZ == numColsX).
> >
> > Unfortunately, you have a pair of matrices which have different numbers
> > of rows, and the same number of columns, but you want a pair of matrices
> > with the same number of rows and (possibly) different numbers of columns.
> >
> >
> >> So I tried to run matrixmult with:
> >
> > matrixmult numRowsA 6076937 numColsA 20444 numRowsB 20444
> numColsB
> >> 87 \
> >> inputPathA
> >> /asfmailarchives/mahout0.4/sparse1gramstem/tfidfmatrix/matrix \
> >> inputPathB /asfmailarchives/mahout0.4/svd/transpose244
> >
> >
> >> (inputPathA points to the output of the rowid job)
> >>
> >> This results in:
> >> Exception in thread "main" org.apache.mahout.math.CardinalityException:
> >>
> >
> >
> >> In the code, I see the test that row counts must be identical for the
> two
> >>
> >> input matrices. Thus, it seems like the job requires me to transpose
> this
> >> large matrix, just to retranspose it back to it's original form during
> >> the
> >> multiplication? Or have I missed something crucial again?
> >>
> >
> > You actually need to transpose the input matrix (T), and then rerun with
> > T^t
> > and E^t (the latter you apparently already have created).
> >
> > We should really rename the "matrixmultiply" job to be called
> > "transposematrixmultiply", because that's what it really does.
> >
> > jake
> >
>
