From user-return-18400-apmail-cassandra-user-archive=cassandra.apache.org@cassandra.apache.org Mon Jul 4 08:49:51 2011 Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id DA1496DF6 for ; Mon, 4 Jul 2011 08:49:51 +0000 (UTC) Received: (qmail 62832 invoked by uid 500); 4 Jul 2011 08:49:48 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 62796 invoked by uid 500); 4 Jul 2011 08:49:26 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 62775 invoked by uid 99); 4 Jul 2011 08:49:24 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 04 Jul 2011 08:49:24 +0000 X-ASF-Spam-Status: No, hits=4.0 required=5.0 tests=FREEMAIL_FROM,FREEMAIL_REPLY,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of silvere.lestang@gmail.com designates 209.85.220.172 as permitted sender) Received: from [209.85.220.172] (HELO mail-vx0-f172.google.com) (209.85.220.172) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 04 Jul 2011 08:49:19 +0000 Received: by vxi40 with SMTP id 40so4377358vxi.31 for ; Mon, 04 Jul 2011 01:48:58 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; bh=ah+Ret86BY2XbzIKqI6MdG/2Z8GBhHdGVnna7Kwz0rA=; b=FW+cXNU/wF2mBXhzUaie5FZJ/pzoFPBhyZUQIWRsyg22BIFKFiaFkve2x/PBIREBz4 GgD0qPg8tRMh2r2HSeX/so0Bk4/k6uFlv3LfJBcUdyTZ1s1Wm8SxgOQuVOlnJZAKQby7 q2fCeJr1Fnu0XZG1HOD8qE4UAFAUliDbazIe8= MIME-Version: 1.0 Received: by 10.52.72.51 with SMTP id a19mr4390395vdv.12.1309769337806; Mon, 04 Jul 2011 01:48:57 -0700 (PDT) Received: by 10.52.157.169 with HTTP; Mon, 4 Jul 2011 01:48:57 -0700 (PDT) In-Reply-To: <7506C99D83A0A54F8127A4931F6CF0B004DBD7BD@IE2RD2XVS531.red002.local> References: <7506C99D83A0A54F8127A4931F6CF0B004DBD7BD@IE2RD2XVS531.red002.local> Date: Mon, 4 Jul 2011 10:48:57 +0200 Message-ID: Subject: Re: Multi-type column values in single CF From: =?ISO-8859-1?Q?Silv=E8re_Lestang?= To: user@cassandra.apache.org Cc: "osishkin@gmail.com" Content-Type: multipart/alternative; boundary=20cf3071ca241e1e0504a73a7021 --20cf3071ca241e1e0504a73a7021 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable We do pretty much the same thing here, dynamic column with a timestamp for column name and a different value type for each row. We use the serialization/deserialization classes provided with Hector and store the type of the value in the key of the row. Example of row key: "b6c8a1e7281761e62230ea76daa3d841#INT" =3D> every values are Integer "7f30a6a2bbb1b921afc8216d8c5d9257#DOUBLE" =3D> every values are Double .... If I'll have to do it again, I'll try to use (Dynamic)CompositeType for value or an equivalent mechanism as suggested by Roland. On 3 July 2011 15:07, Roland Gude wrote: > You could do the serialization for all your supported datatypes yourself > (many libraries for serialization are available and a pretty thorough > benchmarking for them can be found here: > https://github.com/eishay/jvm-serializers/wiki) and prepend the serialize= d > bytes with an identifier for your datatype. > This would not avoid casting though but would still be better performing > then serializing to strings as it is done in your example. > Prepending the values with the id seems to be better to me, because you c= an > be sure that a new insertion to some field overwrites the correct column > even if it changed the type. > > -----Urspr=FCngliche Nachricht----- > Von: osishkin osishkin [mailto:osishkin@gmail.com] > Gesendet: Sonntag, 3. Juli 2011 13:52 > An: user@cassandra.apache.org > Betreff: Multi-type column values in single CF > > Hi all, > > I need to store column values that are of various data types in a > single column family, i.e I have column values that are integers, > others that are strings, and maybe more later. All column names are > strings (no comparator problem for me). > The thing is I need to store unstructured data - I do not have fixed > and known-in-advacne column names, so I can not use a fixed static map > for casting the values back to their original type on retrieval from > cassandra. > > My immediate naive thought is to simply prefix every column name with > the type the value needs to be cast back to. > For example i'll do the follwing conversion to the columns of some key - > {'attr1': 'val1','attr2': 100} ~> {'str_attr1' : 'val1', 'int_attr2' : > '100'} > and only then send it to cassandra. This way I know to what should I > cast it back. > > But all this casting back and forth on the client side seems to me to > be very bad for performance. > Another option is to split the columns on dedicated column families > with mathcing validation types - a column family for integer values, > one for string, one for timestamp etc. > But that does not seem very efficient either (and worse for any > rollback mechanism), since now I have to perform several get calls on > multiple CFs where once I had only one. > > I thought perhaps someone has encountered a similar situation in the > past, and can offer some advice on the best course of action. > > Thank you, > Osi > > > --20cf3071ca241e1e0504a73a7021 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable We do pretty much the same thing here, dynamic column with a timestamp for = column name and a different value type for each row. We use the serializati= on/deserialization classes provided with Hector and store the type of the v= alue in the key of the row. Example of row key:
"b6c8a1e7281761e62230ea76daa3d841#INT" =3D> every values are I= nteger
"7f30a6a2bbb1b921afc8216d8c5d9257#DOUBLE" =3D>= ; every values are Double
....
If I'll have to do i= t again, I'll try to use (Dynamic)CompositeType for value or an equival= ent mechanism as suggested by Roland.

On 3 July 2011 15:07, Roland Gude <roland.gude@yooch= oose.com> wrote:
You could do the serialization for all your supported datatypes yourself (m= any libraries for serialization are available and a pretty thorough benchma= rking for them can be found here: https://github.com/eishay/jvm-serialize= rs/wiki) and prepend the serialized bytes with an identifier for your d= atatype.
This would not avoid casting though but would still be better performing th= en serializing to strings as it is done in your example.
Prepending the values with the id seems to be better to me, because you can= be sure that a new insertion to some field overwrites the correct column e= ven if it changed the type.

-----Urspr=FCngliche Nachricht-----
Von: osishkin osishkin [mailto:osishk= in@gmail.com]
Gesendet: Sonntag, 3. Juli 2011 13:52
An: user@cassandra.apache.org<= /a>
Betreff: Multi-type column values in single CF

Hi all,

I need to store column values that are of various data types in a
single column family, i.e I have column values that are integers,
others that are strings, and maybe more later. All column names are
strings (no comparator problem for me).
The thing is I need to store unstructured data - I do not have fixed
and known-in-advacne column names, so I can not use a fixed static map
for casting the values back to their original type on retrieval from
cassandra.

My immediate naive thought is to simply prefix every column name with
the type the value needs to be cast back to.
For example i'll do the follwing conversion to the columns of some key = -
{'attr1': 'val1','attr2': 100} =A0~> {'str_a= ttr1' : 'val1', 'int_attr2' : '100'}
and only then send it to cassandra. This way I know to what should I
cast it back.

But all this casting back and forth on the client side seems to me to
be very bad for performance.
Another option is to split the columns on dedicated column families
with mathcing validation types - a column family for integer values,
one for string, one for timestamp etc.
But that does not seem very efficient either (and worse for any
rollback mechanism), since now I have to perform several get calls on
multiple CFs where once I had only one.

I thought perhaps someone has encountered a similar situation in the
past, and can offer some advice on the best course of action.

Thank you,
Osi



--20cf3071ca241e1e0504a73a7021--