mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <ted.dunn...@gmail.com>
Subject Re: Need a little help with using SVD
Date Tue, 29 Mar 2011 05:14:17 GMT
The issue here is almost certainly that the sort implied by this conversion
is only reasonable with a small number of non-zero elements.  Special casing
a relatively dense case is probably worthwhile.


On Mon, Mar 28, 2011 at 10:04 PM, Timothy Potter <thelabdude@gmail.com>wrote:

> Hi Jake,
>
> I've tracked down the bottleneck in the TransposeJob's reducer to the line
> where Mahout creates a new SequentialAccessSparseVector from a
> RandomAccessSparseVector after the while loop completes:
>
>      SequentialAccessSparseVector outVector = new
> SequentialAccessSparseVector(tmp);
>
> For high-frequency terms (some of which occur over ~1M times in my data),
> the code to create a SequentialAccessSparseVector from a
> RandomAccessSparseVector bogs down completely. Of course, I should probably
> get rid of the terms that occur this frequently and it seems I could
> continue to increase mapred.task.timeout but as you said, "transpose job is
> totally trivial" so there may be a bigger issue here that is worth digging
> into further. I think it has something to do with OrderedIntDoubleMapping,
> but haven't had enough time to devote to this issue to know for sure ...
> I'll keep at it though.
>
> Cheers,
> Tim
>
> On Mon, Mar 21, 2011 at 6:34 PM, Jake Mannix <jake.mannix@gmail.com>
> wrote:
>
> >
> >
> > On Sun, Mar 20, 2011 at 5:27 PM, Timothy Potter <thelabdude@gmail.com
> >wrote:
> >
> >> Hi Jake,
> >>
> >>
> >> I'm using a cluster with 8 data nodes (EC2 xlarge instances) with 3
> >> reducers
> >> per node and mapred.child.java.opts=-Xmx4096m. The map tasks completed
> >> within a few minutes but then all of my 24 reducers failed near the 70%
> >> mark
> >> with error "Task attempt_201103201840_0004_r_000023_0 failed to report
> >> status for 601 seconds. Killing!" The data node servers stayed at a
> >> healthy
> >> load avg below 4 and never paged ...
> >>
> >
> > Weird.   The transpose job is totally trivial, but it may be that the
> > vectors you
> > are producing are abnormally large.  In fact, if this came from text,
> this
> > is
> > probably the case: very common words (which aren't stop words) occur
> > in most of your documents, and so are very very large.  Might they be
> > larger than your memory? 6-million entries, each one, when represented
> > sparsely (sadly, in this case), have an int and a float, so still less
> than
> > 100MB,
> > so that's not it.
> >
> > BTW: you need to make sure you use the same number of reducers as when
> > you made your eigenvector matrix, as that's a requirement for a map-side
> > join.   That should probably be documented!
> >
> >
> >> So I increased the "mapred.task.timeout" Hadoop parameter to 20 minutes
> >> instead of the default 10 minutes and it failed again. The reduce code
> for
> >> the TransposeJob looks straight-forward, so I'm going to have to dig in
> >> deeper to figure out what's causing the problem.
> >>
> >
> > Yeah, if you can see where it's timing out, that would be great.  Not
> sure
> > why that task should ever take very long.
> >
> >   -jake
> >
> >
> >
> >> Cheers,
> >> Tim
> >>
> >> On Sat, Mar 19, 2011 at 3:34 PM, Jake Mannix <jake.mannix@gmail.com>
> >> wrote:
> >>
> >> > On Sat, Mar 19, 2011 at 10:32 AM, Timothy Potter <
> thelabdude@gmail.com
> >> >wrote:
> >> >
> >> >> Regarding Jake's comment: " ... you need to run the RowIdJob on these
> >> >> tfidf-vectors first ..."
> >> >>
> >> >> I did this and now have an m x n matrix T (m=6076937, n=20444). My
> SVD
> >> >> eigenvector matrix E is p x q (p=87, q=20444).
> >> >
> >> >
> >> > Ok, so to help you understand what's going on here, I'm going to go
> into
> >> > a little of the inner details of what's going on here.
> >> >
> >> > You are right, you have a matrix T, with 6,076,937 rows, and each row
> >> has
> >> > 20,444 columns (most of which are zero, and it's represented sparsely,
> >> but
> >> > still, they live in a vector space of dimension 20,444).  Similarly,
> >> you've
> >> > made an eigenvector matrix, which has 87 rows (ie 87 eigenvectors) and
> >> > each of these rows has exactly 20,444 columns (and most likely,
> they'll
> >> > all be nonzero, because eigenvectors have no reason to be sparse).
> >> >
> >> > In particular, T and E are represented as *lists of rows*, each row is
> a
> >> > vector of dimension 20,444.  T has six million of these rows, and E
> has
> >> > only 87.
> >> >
> >> >
> >> >> So to multiply these two
> >> >> matrices, I need to transpose E so that the number of columns in T
> >> equals
> >> >> the number of rows in E (i.e. E^T is q x p) the result of the
> >> matrixmult
> >> >> would give me an m x p matrix (m=6076937, p=87).
> >> >>
> >> >
> >> > You're exactly right that you want to multiply T by E^T, because you
> >> can't
> >> > compute T * E.
> >> >
> >> > The way it turns out in practice, computing the matrix product of two
> >> > matrices as a map-reduce job is efficiently done as a map-side join on
> >> > two row-based matrices with the same number of rows, and the columns
> >> > are the ones which are different.  In particular, if you take a matrix
> X
> >> > which
> >> > is represented as a set of numRowsX rows, each of which has numColsX,
> >> > and another matrix with numRowsY == numRowsX, each of which has
> >> > numColsY (!= numColsX), then by summing the outer-products of each
> >> > of the numRowsX pairs of vectors, you get a matrix of with numRowsZ ==
> >> > numColsX, and numColsZ == numColsY (if you instead take the reverse
> >> > outer product of the vector pairs, you can end up with the transpose
> of
> >> > this
> >> > final result, with numRowsZ == numColsY, and numColsZ == numColsX).
> >> >
> >> > Unfortunately, you have a pair of matrices which have different
> numbers
> >> > of rows, and the same number of columns, but you want a pair of
> matrices
> >> > with the same number of rows and (possibly) different numbers of
> >> columns.
> >> >
> >> >
> >> >> So I tried to run matrixmult with:
> >> >
> >> > matrixmult --numRowsA 6076937 --numColsA 20444 --numRowsB 20444
> >> --numColsB
> >> >> 87 \
> >> >> --inputPathA
> >> >> /asf-mail-archives/mahout-0.4/sparse-1-gram-stem/tfidf-matrix/matrix
> \
> >> >> --inputPathB /asf-mail-archives/mahout-0.4/svd/transpose-244
> >> >
> >> >
> >> >> (--inputPathA points to the output of the rowid job)
> >> >>
> >> >> This results in:
> >> >> Exception in thread "main"
> org.apache.mahout.math.CardinalityException:
> >> >>
> >> >
> >> >
> >> >> In the code, I see the test that row counts must be identical for the
> >> two
> >> >>
> >> >> input matrices. Thus, it seems like the job requires me to transpose
> >> this
> >> >> large matrix, just to re-transpose it back to it's original form
> during
> >> >> the
> >> >> multiplication? Or have I missed something crucial again?
> >> >>
> >> >
> >> > You actually need to transpose the input matrix (T), and then re-run
> >> with
> >> > T^t
> >> > and E^t (the latter you apparently already have created).
> >> >
> >> > We should really rename the "matrixmultiply" job to be called
> >> > "transposematrixmultiply", because that's what it really does.
> >> >
> >> >   -jake
> >> >
> >>
> >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message