mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pat Ferrel <pat.fer...@gmail.com>
Subject Re: Problem of dimensions
Date Mon, 21 Jul 2014 22:46:43 GMT
And the conversion to Matrix instantiates the new rows, why not the conversion to Dense?

On Jul 21, 2014, at 3:41 PM, Anand Avati <avati@gluster.org> wrote:

On Mon, Jul 21, 2014 at 3:35 PM, Pat Ferrel <pat@occamsmachete.com> wrote:

> If you do drm.plus(1) this converts to a dense matrix, which is what the
> result must be anyway, and does add the scalar to all rows, even missing
> ones.
> 
> 
Pat, I mentioned this in my previous email already. drm.plus(1) completely
misses the point. It converts DRM into an in-core matrix and applies plus()
method on Matrix. The result is a Matrix, not DRM.

drm.plus(1) is EXACTLY the same as:

Matrix m = drm.collect()
m.plus(1)

The implicit def drm2InCore() syntactic sugar is probably turning out to be
dangerous in this case, in terms of hinting the wrong meaning.

Thanks





> On Jul 21, 2014, at 3:23 PM, Dmitriy Lyubimov <dlieu.7@gmail.com> wrote:
> 
> perhaps just compare row count with max(key)? that's exactly what lazy
> nrow() currently does in this case.
> 
> 
> On Mon, Jul 21, 2014 at 3:21 PM, Dmitriy Lyubimov <dlieu.7@gmail.com>
> wrote:
> 
>> 
>> ok. so it should be easy to fix at least everything but elementwise
> scalar
>> i guess.
>> 
>> Since the notion of "missing rows" is only defined for int-keyed
> datasets,
>> then ew scalar technically should work for non-int keyed datasets
> already.
>> 
>> as for int-keyed datasets, i am not sure what is the best strategy.
>> Obviously, one can define sort of normalization/validation of int-keyed
>> dataset routine, but it would be fairly expensive to run "just because".
>> Perhaps there's a cheap test (as cheap as row count job) to run to test
> for
>> int keys consistency when matrix is first created.
>> 
>> 
>> 
>> On Mon, Jul 21, 2014 at 3:12 PM, Anand Avati <avati@gluster.org> wrote:
>> 
>>> 
>>> 
>>> 
>>> On Mon, Jul 21, 2014 at 3:08 PM, Dmitriy Lyubimov <dlieu.7@gmail.com>
>>> wrote:
>>> 
>>>> 
>>>> 
>>>> 
>>>> On Mon, Jul 21, 2014 at 3:06 PM, Anand Avati <avati@gluster.org>
> wrote:
>>>> 
>>>>> Dmitriy, comments inline -
>>>>> 
>>>>> On Jul 21, 2014, at 1:12 PM, Dmitriy Lyubimov <dlieu.7@gmail.com>
>>>>> wrote:
>>>>> 
>>>>>> And no, i suppose it is ok to have "missing" rows even in case of
>>>>>> int-keyed matrices.
>>>>>> 
>>>>>> there's one thing that you probably should be aware in this context
>>>>>> though: many algorithms don't survive empty (row-less) partitions,
in
>>>>>> whatever way they may come to be. Other than that, I don't feel
> every row
>>>>>> must be present -- even if there's implied order of the rows.
>>>>>> 
>>>>> 
>>>>> I'm not sure if that is necessarily true. There are three operators
>>>>> which break pretty badly with with missing rows.
>>>>> 
>>>>> AewScalar - operation like A + 1 is just not applied on the missing
>>>>> row, so the final matrix will have 0's in place of 1s.
>>>>> 
>>>> 
>>>> Indeed. i have no recourse at this point.
>>>> 
>>>> 
>>>>> 
>>>>> AewB, CbindAB - function after cogroup() throws exception if a row was
>>>>> present on only one matrix. So I guess it is OK to have missing rows
> as
>>>>> long as both A and B have the exact same missing row set. Somewhat
>>>>> quirky/nuanced requirement.
>>>>> 
>>>> 
>>>> Agree. i actually was not aware that's a cogroup() semantics in spark.
> I
>>>> though it would have an outer join semantics (as in Pig, i believe).
> Alas,
>>>> no recourse at this point either.
>>>> 
>>> 
>>> The exception is actually during reduceLeft after cogroup(). Cogroup()
>>> itself is probably an outer-join.
>>> 
>>> 
>>> 
>> 
> 
> 


Mime
View raw message