one more word on row labels.
it seems like historical DRM interpretation of row keys (as indexes vs.
labels) has been a bit unfortunate.
But in the end it turned out it often has some strange synergy. e.g., if
you compute a big svd,
val (U, V, s) = dssvd(A, ...)
then it doesn't matter if rows of A are labeled by strings or their ordinal
Int indices. it is all transparent for underlying pipeline. all it means
that matrix U will have the same type of keys and the same semantics as the
keys of A (either e.g., document labels of a string type, or a matrix row
index of Int type). More over, not only dssvd's userfacing API is
oblivious of key type of A, but it turns out its implementation is
oblivious of true semantics of key rows of A as well.
This mostly goes down to a simple notion that selfsquare A'A is logically
oblivious of row index type as well and that any matrix A inside
optimization plan can actually be formed as A' if needed, as long as it
doesn't meet the optimization barrier (i.e., collected or saved)
On Wed, Mar 29, 2017 at 9:37 AM, Dmitriy Lyubimov <dlieu.7@gmail.com> wrote:
>
>
> On Wed, Mar 29, 2017 at 9:26 AM, Pat Ferrel <pat@occamsmachete.com> wrote:
>
>>
>> The other missing bit is dataframes. R and Spark have them in different
>> forms but Mahout largely ignores the issue of real world object ids.
>
>
> Mahout only supports matrices and vectors, not data frames.
>
> Data frames imply mix of various types of data which yet to be converted
> to numerical data to be consumed by algebraic algorithm (in R, usually done
> via formula). Unfortunately Mahout has no extension for formula. As for
> data frames, usually native data frames (e.g., spark data frames
> specifically) work reasonably well for vectorization of nonnumerical data.
>
> distributed matrices are indeed do not support column labels, and row
> labels are quasisupported, meaning they share label nature with unordered
> row index for transposition purposes, i.e., one can either have row labels
> and limited transposition semantics, or one can have integer labels
> interpreted as column index for transposition purposes, but not both.
>
> another way is to use mahout NamedVectors for the purposes of row
> labeling, but this is not supported consistently in any given elementary
> solver.
>
>
>>
>>
