commons-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ted Dunning" <ted.dunn...@gmail.com>
Subject Re: [math] Fwd: [jira] Commented: (MAHOUT-65) Add Element Labels to Vectors and Matrices
Date Wed, 22 Oct 2008 02:13:30 GMT
The primary requirement is the ability to persist/transfer matrices with
some matrix, row, column and cell level attributes.

This is relative to the Mahout project which involves parallelization of
various programs, some of which transfer matrices around.

On Tue, Oct 21, 2008 at 1:53 PM, Rory Winston <rory.winston@gmail.com>wrote:

> What are the requirements here? Obviously something like XML/JSON provide
> readability, numerous parser implementations across languages, and
> relatively easy extensibility, at the cost of increased verbosity and
> performance. Features like row/column labels would be straightforward -
> there are also numerous potential ways to represent sparse matrices, and
> attaching metadata to a matrix is no problem. But this may have a tendency
> to incur a certain amount of bloat.
>
> Rory
>
>
> luc.maisonobe@free.fr wrote:
>
>> Well, in this simple situation, I agree it would be interesting to have
>> an external representation format with row/column labels.
>>
>> I have only one suggestion: if a system (for example commons-math) does
>> not
>> provide the labels, there should be a default value generated in the text
>> representation. An obvious candidate is the index (starting from 1)
>> in string format. The idea is to keep things simple by avoiding optional
>> parts.
>>
>> Luc
>>
>> ----- Mail Original -----
>> De: "Ted Dunning" <ted.dunning@gmail.com>
>> À: "Commons Users List" <user@commons.apache.org>
>> Envoyé: Mardi 21 Octobre 2008 12:31:54 GMT +01:00 Amsterdam / Berlin /
>> Berne / Rome / Stockholm / Vienne
>> Objet: Re: [math] Fwd: [jira] Commented: (MAHOUT-65) Add Element Labels to
>> Vectors and Matrices
>>
>> I should provide more context.
>>
>> The meta-data (attributes) that are envisioned here are typically going to
>> be be row and column labels.  This is extremely helpful for applications
>> such as recommendation engines.  It is entirely possible to hold this
>> meta-data outside the matrices, but it is very useful to keep it inside so
>> that, for example, label-aware matrix products can be implemented without
>> having to externally intersect label tables and permute the matrices in
>> order to make a normal product work correctly.
>>
>> Meta-data indicating things such as bandedness or sparsity are not part of
>> this use case, but that is viable matrix level meta-data as well.
>>
>> On Tue, Oct 21, 2008 at 1:54 AM, <luc.maisonobe@free.fr> wrote:
>>
>>
>>
>>> I am a little puzzled by this topic.
>>>
>>> One the on hand, I tried many time to do such things and always failed. I
>>> know really think persistence/serialization/transfer/interoperability is
>>> a
>>> complex task by itself that is completely out of scope to very low level
>>> components. It already belongs to a middle level layer (did I say
>>> middleware
>>> ?).
>>>
>>> There are many different use cases for data storage/transfer from within
>>> an
>>> internal algorithm representation to something more external or more
>>> long-lived. In some cases, basic data will be enough and its meaning will
>>> already be known from both sides of communication so meta-data will be
>>> cumbersome. In other cases meta-data are a great improvement (think
>>> matrices
>>> shapes or non-null elements in sparse cases) but data can still be
>>> exchanged
>>> without them. In still other cases meta-data are mandatory. Nobody will
>>> also
>>> agree on what meta-data should contain.
>>>
>>> For these reasons, I tend to promote a separated approach: low level
>>> layers
>>> provide access to basic information (both data and things that could be
>>> considered from outside as meta-data) through their API (getEntry(i, j),
>>> getRowDimension(), isTriangular() ...) and a dedicated project from
>>> middle
>>> level layer uses it for externalization. This project would already be
>>> difficult enough.
>>>
>>> On the other hand, if the matrix/vector case can be handled simply and if
>>> an almost general representation can already be adopted for several use
>>> cases, then it could be interesting to use it even in low level
>>> libraries.
>>> In this case, I think either XML or JSON would be nice. I personaly
>>> prefer
>>> XML, but this really is not a point. Once again, in this case I would
>>> avoid
>>> to bind too deeply data and meta-data. This would allow simple
>>> implementations to be done and would be more easy to extend if we want.
>>> For
>>> example a dense matrix would have some structure that is a simple big
>>> array
>>> of numbers, the columns labels being either above or below but not mixed
>>> within the array.
>>>
>>> I'm not sure this comment answers your question though.
>>>
>>> Luc
>>>
>>> ----- Mail Original -----
>>> De: "Ted Dunning" <ted.dunning@gmail.com>
>>> À: "Commons Users List" <user@commons.apache.org>
>>> Envoyé: Mardi 21 Octobre 2008 07:34:42 GMT +01:00 Amsterdam / Berlin /
>>> Berne / Rome / Stockholm / Vienne
>>> Objet: [math] Fwd: [jira] Commented: (MAHOUT-65) Add Element Labels to
>>> Vectors and Matrices
>>>
>>>
>>>
>>>
>>> Luc and other commons math folk:
>>>
>>> Do you guys have opinions about serialization formats for matrices (both
>>> dense and sparse, both with and without row, column and cell attributes)?
>>>
>>>
>>> ---------- Forwarded message ----------
>>> From: Jeff Eastman < jdog@windwardsolutions.com >
>>> Date: Mon, Oct 20, 2008 at 10:03 PM
>>> Subject: Re: [jira] Commented: (MAHOUT-65) Add Element Labels to Vectors
>>> and Matrices
>>> To: mahout-dev@lucene.apache.org
>>>
>>>
>>>
>>> Ted Dunning wrote:
>>>
>>>
>>> I see what you mean.
>>>
>>> To repeat in other words, the problems that need to be solved are:
>>>
>>> a) there are many uses already so adding attributes should be transparent
>>> to
>>> those who don't use them
>>>
>>> b) the encoding should not be ad hoc because this would be our second ad
>>> hoc
>>> encoding and only one should ever be allowed before using a standard
>>>
>>> +1
>>>
>>>
>>>
>>> So here is a (kind of) concrete proposal:
>>>
>>> a) use JSON or Thrift for concrete syntax
>>>
>>> Any preferences here? This might also impact other Mahout packages in the
>>> future, so everybody please weigh in. In general, it seems that having a
>>> common, public encoding for matrix and vector data would help users mix
>>> and
>>> match the Mahout services. What are the requirements of these other
>>> services? From inspection, it looks like only the clustering packages use
>>> them currently.
>>>
>>> Jeff
>>>
>>>
>>>
>>> --
>>> ted
>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: user-unsubscribe@commons.apache.org
>>> For additional commands, e-mail: user-help@commons.apache.org
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: user-unsubscribe@commons.apache.org
>>> For additional commands, e-mail: user-help@commons.apache.org
>>>
>>>
>>>
>>>
>>
>>
>>
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@commons.apache.org
> For additional commands, e-mail: user-help@commons.apache.org
>
>


-- 
ted
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message