mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dmitriy Lyubimov <dlie...@gmail.com>
Subject Re: Samsara's learning curve
Date Wed, 29 Mar 2017 16:59:39 GMT
one more word on row labels.

it seems like historical DRM interpretation of row keys (as indexes vs.
labels)  has been a bit unfortunate.

But in the end it turned out it often has some strange synergy. e.g., if
you compute a big svd,

val (U, V, s) = dssvd(A, ...)

then it doesn't matter if rows of A are labeled by strings or their ordinal
Int indices. it is all transparent for underlying pipeline. all it means
that matrix U will have the same type of keys and the same semantics as the
keys of A (either e.g., document labels of a string type, or a matrix row
index of Int type). More over, not only dssvd's user-facing API is
oblivious of key type of A, but it turns out its implementation is
oblivious of true semantics of key rows of A as well.

This mostly goes down to a simple notion that self-square A'A is logically
oblivious of row index type as well and that any matrix A inside
optimization plan can actually be formed as A' if needed, as long as it
doesn't meet the optimization barrier (i.e., collected or saved)


On Wed, Mar 29, 2017 at 9:37 AM, Dmitriy Lyubimov <dlieu.7@gmail.com> wrote:

>
>
> On Wed, Mar 29, 2017 at 9:26 AM, Pat Ferrel <pat@occamsmachete.com> wrote:
>
>>
>> The other missing bit is dataframes. R and Spark have them in different
>> forms but Mahout largely ignores the issue of real world object ids.
>
>
> Mahout only supports matrices and vectors, not data frames.
>
> Data frames imply mix of various types of data which yet to be converted
> to numerical data to be consumed by algebraic algorithm (in R, usually done
> via formula). Unfortunately Mahout has no extension for formula. As for
> data frames, usually native data frames (e.g., spark data frames
> specifically) work reasonably well for vectorization of non-numerical data.
>
> distributed matrices are indeed do not support column labels, and row
> labels are quasi-supported, meaning they share label nature with unordered
> row index for transposition purposes, i.e., one can either have row labels
> and limited transposition semantics, or one can have integer labels
> interpreted as column index for transposition purposes, but not both.
>
> another way is to use mahout NamedVectors for the purposes of row
> labeling, but this is not supported consistently in any given elementary
> solver.
>
>
>>
>>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message