Mailing-List: contact dev-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@cassandra.apache.org
Received-SPF: pass (athena.apache.org: domain of jbellis@gmail.com designates
 209.85.214.44 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CAH3cagNawxcybbAbop9yj2z6zGeOmnrf8T0bmycHGeaTPLgHCA@mail.gmail.com>
References: 
 <CAH3cagN_2Sf-weqgsHEE1Kb6NrET7pPUQmzFZF5hQncKPKGQrQ@mail.gmail.com>
 <CB9A2C94.286AB%bone@alumni.brown.edu>
 <CAH3cagNawxcybbAbop9yj2z6zGeOmnrf8T0bmycHGeaTPLgHCA@mail.gmail.com>
From: Jonathan Ellis <jbellis@gmail.com>
Date: Thu, 29 Mar 2012 14:35:52 -0500
Message-ID: 
 <CALdd-zjfE7eEBBXkC2gJJ_dr5JXX1DiTnWewS1v7-hpLVUVtVw@mail.gmail.com>
Subject: Re: Document storage
To: dev@cassandra.apache.org
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

I kind of hijacked
https://issues.apache.org/jira/browse/CASSANDRA-3647 ("Sylvain
suggests we start with (non-nested) lists, maps, and sets. I agree
that this is a great 80/20 approach to the problem") but we could
split it out to another ticket.

On Thu, Mar 29, 2012 at 2:24 PM, Ben McCann <ben@benmccann.com> wrote:
> Thanks Jonathan. =A0The only reason I suggested JSON was because it alrea=
dy
> has support for lists. =A0Native support for lists in Cassandra would mor=
e
> than satisfy me. =A0Are there any existing proposals or a bug I can follo=
w?
> =A0I'm not familiar with the Cassandra codebase, so I'm not entirely sure=
 how
> helpful I can be, but I'd certainly be interested in taking a look to see
> what's required.
>
> -Ben
>
>
> On Thu, Mar 29, 2012 at 12:19 PM, Brian O'Neill <bone@alumni.brown.edu>wr=
ote:
>
>> Jonathan,
>>
>> I was actually going to take this up with Nate McCall a few weeks back. =
=A0I
>> think it might make sense to get the client development community togeth=
er
>> (Netflix w/ Astyanax, Hector, Pycassa, Virgil, etc.)
>>
>> I agree whole-heartedly that it shouldn't go into the database for all t=
he
>> reasons you point out.
>>
>> If we can all decide on some standards for data storage (e.g. composite
>> types), indexing strategies, etc. =A0We can provide higher-level functio=
ns
>> through the client libraries and also provide interoperability between
>> them. =A0(without bloating Cassandra)
>>
>> CCing Nate. =A0Nate, thoughts?
>> I wouldn't mind coordinating/facilitating the conversation. =A0If we kno=
w
>> who should be involved.
>>
>> -brian
>>
>> ----
>> Brian O'Neill
>> Lead Architect, Software Development
>> Health Market Science | 2700 Horizon Drive | King of Prussia, PA 19406
>> p: 215.588.6024blog: http://weblogs.java.net/blog/boneill42/
>> blog: http://brianoneill.blogspot.com/
>>
>>
>>
>>
>>
>>
>>
>> On 3/29/12 3:06 PM, "Ben McCann" <ben@benmccann.com> wrote:
>>
>> >Jonathan, I asked Brian about his REST
>> >API<
>> https://groups.google.com/forum/?fromgroups#!topic/virgil-users/oncBas
>> >9C8Us>and
>> >he said he does not take the json objects and split them because the
>> >client libraries do not agree on implementations. =A0This was exactly m=
y
>> >concern as well with this solution. =A0I would be perfectly happy to do=
 it
>> >this way instead of using JSON if it were standardized. =A0The reason I
>> >suggested JSON is that it is standardized. =A0As far as I can tell,
>> >Cassandra
>> >doesn't support maps and lists in a standardized way today, which is th=
e
>> >root of my problem.
>> >
>> >-Ben
>> >
>> >
>> >On Thu, Mar 29, 2012 at 11:30 AM, Drew Kutcharian <drew@venarc.com>
>> wrote:
>> >
>> >> Yes, I meant the "row header index". What I have done is that I'm
>> >>storing
>> >> an object (i.e. UserProfile) where you read or write it as a whole (a
>> >>user
>> >> updates their user details in a single page in the UI). So I serializ=
e
>> >>that
>> >> object into a binary JSON using SMILE format. I then compress it usin=
g
>> >> Snappy on the client side. So as far as Cassandra cares it's storing =
a
>> >> byte[].
>> >>
>> >> Now on the client side, I'm using cassandra-cli with a custom type th=
at
>> >> knows how to turn a byte[] into a JSON text and back. The only issue =
was
>> >> CASSANDRA-4081 where "assume" doesn't work with custom types. If
>> >> CASSANDRA-4081 gets fixed, I'll get the best of both worlds.
>> >>
>> >> Also advantages of this vs. the thrift based Super Column families ar=
e:
>> >>
>> >> 1. Saving extra CPU usage on the Cassandra nodes. Since
>> >> serialize/deserialize and compression/decompression happens on the
>> >>client
>> >> nodes where there is plenty idle CPU time
>> >>
>> >> 2. Saving network bandwidth since I'm sending over a compressed byte[=
]
>> >>
>> >>
>> >> -- Drew
>> >>
>> >>
>> >>
>> >> On Mar 29, 2012, at 11:16 AM, Jonathan Ellis wrote:
>> >>
>> >> > On Thu, Mar 29, 2012 at 1:11 PM, Drew Kutcharian <drew@venarc.com>
>> >> wrote:
>> >> >>> I think this is a much better approach because that gives you the
>> >> >>> ability to update or retrieve just parts of objects efficiently,
>> >> >>> rather than making column values just blobs with a bunch of speci=
al
>> >> >>> case logic to introspect them. =A0Which feels like a big step
>> >>backwards
>> >> >>> to me.
>> >> >>
>> >> >> Unless your access pattern involves reading/writing the whole
>> >>document
>> >> each time. In that case you're better off serializing the whole docum=
ent
>> >> and storing it in a column as a byte[] without incurring the overhead=
 of
>> >> column indexes. Right?
>> >> >
>> >> > Hmm, not sure what you're thinking of there.
>> >> >
>> >> > If you mean the "index" that's part of the row header for random
>> >> > access within a row, then no, serializing to byte[] doesn't save yo=
u
>> >> > anything.
>> >> >
>> >> > If you mean secondary indexes, don't declare any if you don't want
>> >>any.
>> >> :)
>> >> >
>> >> > Just telling C* to store a byte[] *will* be slightly lighter-weight
>> >> > than giving it named columns, but we're talking negligible compared=
 to
>> >> > the overhead of actually moving the data on or off disk in the firs=
t
>> >> > place. =A0Not even close to being worth giving up being able to dea=
l
>> >> > with your data from standard tools like cqlsh, IMO.
>> >> >
>> >> > --
>> >> > Jonathan Ellis
>> >> > Project Chair, Apache Cassandra
>> >> > co-founder of DataStax, the source for professional Cassandra suppo=
rt
>> >> > http://www.datastax.com
>> >>
>> >>
>>
>>
>>


--=20
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of DataStax, the source for professional Cassandra support
http://www.datastax.com