mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pat Ferrel <pat.fer...@gmail.com>
Subject Re: Problem of dimensions
Date Sun, 20 Jul 2014 00:04:05 GMT
We need to back up a bit here. This involves two questions, one for core math one for data
prep:

1) The math question: does a CheckpointedDrm need to have a row for every sequential row key
from 0 to nrow? Can there be missing row keys in the sequence and still get correct results
for B %*% C where C and/or B have rows that have no representation in the underlying rdd,
not even n => {} but have the same _nrow passed in during creation.

2) The data prep issue depends on the answer to #1: potentially there are matrices A, B, C,
… All come from data whose rows are IDed by external User IDs. The total of these IDs define
a row cardinality for all matrices. The total number of Mahout row keys will come from the
collected number of unique User IDs.

If the answer to #1 is “yes you must have at least n => {} for every sequential row key
0 through nrow”. Then A, B, C, and so on will need to have the Int row Keys inserted at
all points in the matrices where no data for the external ID was seen. This implies reading
them in as a unit. Rbind cannot do this after each matrix has bee read in since the row key
gaps may not all be at the end of a matrix.

If the answer to #1 is that a non-existant row key (a gap in the sequence) is exactly the
same as having in rdd n => {} then changing only the row cardinality of all matrices to
match the total number of IDs seen will create the correct result. If rbind with drmParallelizeEmpty
can be used to only change the cardinality then it may work.

I’ll keep poking at #1 but would love a definitive answer.
Mime
View raw message