On Sat, Jul 19, 2014 at 5:04 PM, Pat Ferrel <pat.ferrel@gmail.com> wrote:
> We need to back up a bit here. This involves two questions, one for core
> math one for data prep:
>
> 1) The math question: does a CheckpointedDrm need to have a row for every
> sequential row key from 0 to nrow? Can there be missing row keys in the
> sequence and still get correct results for B %*% C where C and/or B have
> rows that have no representation in the underlying rdd, not even n => {}
> but have the same _nrow passed in during creation.
>
> 2) The data prep issue depends on the answer to #1: potentially there are
> matrices A, B, C, … All come from data whose rows are IDed by external User
> IDs. The total of these IDs define a row cardinality for all matrices. The
> total number of Mahout row keys will come from the collected number of
> unique User IDs.
>
> If the answer to #1 is “yes you must have at least n => {} for every
> sequential row key 0 through nrow”. Then A, B, C, and so on will need to
> have the Int row Keys inserted at all points in the matrices where no data
> for the external ID was seen. This implies reading them in as a unit. Rbind
> cannot do this after each matrix has bee read in since the row key gaps may
> not all be at the end of a matrix.
>
> If the answer to #1 is that a nonexistant row key (a gap in the sequence)
> is exactly the same as having in rdd n => {} then changing only the row
> cardinality of all matrices to match the total number of IDs seen will
> create the correct result. If rbind with drmParallelizeEmpty can be used to
> only change the cardinality then it may work.
>
> I’ll keep poking at #1 but would love a definitive answer.
The answer _has_ to be "yes". There cannot be missing row keys for Int
keyed DRMs. The proof for my claim is in my previous mail, that
val drmB = drmA + 1
will give incorrect result (at least on spark backend) if there are such
missing rows.
Thanks
