mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pat Ferrel <pat.fer...@gmail.com>
Subject Re: Problem of dimensions
Date Fri, 18 Jul 2014 19:42:45 GMT
If you could comment on the PR that would be great. In this case I know where the code is you
are talking about.

1) OK, this is a good catch. I didn’t know CheckpointedDrmSpark or really all Drms are to
be immutable, which is actually documented in 2.4 of the DSL PDF. I think this is what Ted
was saying too, assuming I knew it was supposed to be immutable. Scala puts “immutable’
in the fully qualified class name to flag the fact. Wonder if that’s a good idea here? 

2) I’m talking about the R semantics for rbind. Out of the box R is only dense so the semantics
by definition are dense. Putting in all zero rows adds a bunch of 0.0 doubles to a matrix.
 I’m saying you don’t even need or want the empty row keys. This is certainly not what
we want in a sparse vector or matrix unless needed. Please rely on Dmitriy, Sebastian, or
Ted about this and maybe they can contradict me.

3) If I did an rbind do you want me to overload it to take an Int and only touch _nrow (not
even sure this is possible—haven’t looked)? Is this really what you want?


> On Jul 17, 2014, at 4:58 PM, Anand Avati <avati@gluster.org> wrote:
> 
> And I still really doubt if just fudging nrow is a "complete". For e.g, if
> after fixing up nrow (either mutating or by creating a new CheckpointedDrm
> as I described in my previous mail), if you were to do:
> 
>  drmA = ... // somehow nrow is fudged
> 
>  drmB = drmA + 1 // invoke OpAewScalar operator
> 
> I don't see how this would return the correct answer in drmB. mapBlock() on
> drmA is just not performed on those "invisible" rows for the "+ 1" to be
> applied on the cells.

Seems like a good test. I certainly can be done correctly given my understanding below—not
sure if it is.

First you are creating a dense matrix from a sparse one—drmB is really a non-sparse matrix
that is distributed. This requires that all non-existent rows and columns be created in the
new matrix. The map would be over over all IDs from 0 to nrow, also each Vector elements needs
to have 1 added, even the non-existent ones so you need to use the right vector iterator.
There are several cases where dense matrices are created from sparse ones like factorization.
Assumptions about ordinality and the row or columns IDs allow this to happen. So new dense
rows and elements may be created assuming that key ordinality and nrow can be used to determine
missing rows (or columns). The point would be not to force a dense anything unless needed,
as in your case above.

The question is good and i admit that my knowledge of this is not the best so please refer
to the experts.

> 
> I think rbind() is the safest approach here. Also, I'm not sure why you
> feel rbind() is only for "dense" matrices. If the B matrix for rbind
> operator was created from drmParallelizeEmpty() (as shown in the example in
> the commit), the DrmRdd will be holding only empty RandomAccessSparseVectors
> and will be significantly less expensive than a dense operation.

I didn’t say that rbind is only for dense matrices, at least that isn’t what I meant.
If it requires me to calculate missing rows IDs for no reason, it’s wrong. It violates my
understanding of sparse semantics—don’t mess with non-exsistant data in vectors of matrices
unless needed (as in your example of adding 1 to all elements, even non-existant ones). Also
I meant that we shouldn’t be inflexibly bound to R since there are a few sparse cases that
don’t fit and this seems like one.

> 

Mime
View raw message