incubator-cassandra-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Brian O'Neill <b...@alumni.brown.edu>
Subject Re: Document storage
Date Thu, 29 Mar 2012 19:19:41 GMT
Jonathan, 

I was actually going to take this up with Nate McCall a few weeks back.  I
think it might make sense to get the client development community together
(Netflix w/ Astyanax, Hector, Pycassa, Virgil, etc.)

I agree whole-heartedly that it shouldn't go into the database for all the
reasons you point out.

If we can all decide on some standards for data storage (e.g. composite
types), indexing strategies, etc.  We can provide higher-level functions
through the client libraries and also provide interoperability between
them.  (without bloating Cassandra)

CCing Nate.  Nate, thoughts?
I wouldn't mind coordinating/facilitating the conversation.  If we know
who should be involved.

-brian

---- 
Brian O'Neill
Lead Architect, Software Development
Health Market Science | 2700 Horizon Drive | King of Prussia, PA 19406
p: 215.588.6024blog: http://weblogs.java.net/blog/boneill42/
blog: http://brianoneill.blogspot.com/







On 3/29/12 3:06 PM, "Ben McCann" <ben@benmccann.com> wrote:

>Jonathan, I asked Brian about his REST
>API<https://groups.google.com/forum/?fromgroups#!topic/virgil-users/oncBas
>9C8Us>and
>he said he does not take the json objects and split them because the
>client libraries do not agree on implementations.  This was exactly my
>concern as well with this solution.  I would be perfectly happy to do it
>this way instead of using JSON if it were standardized.  The reason I
>suggested JSON is that it is standardized.  As far as I can tell,
>Cassandra
>doesn't support maps and lists in a standardized way today, which is the
>root of my problem.
>
>-Ben
>
>
>On Thu, Mar 29, 2012 at 11:30 AM, Drew Kutcharian <drew@venarc.com> wrote:
>
>> Yes, I meant the "row header index". What I have done is that I'm
>>storing
>> an object (i.e. UserProfile) where you read or write it as a whole (a
>>user
>> updates their user details in a single page in the UI). So I serialize
>>that
>> object into a binary JSON using SMILE format. I then compress it using
>> Snappy on the client side. So as far as Cassandra cares it's storing a
>> byte[].
>>
>> Now on the client side, I'm using cassandra-cli with a custom type that
>> knows how to turn a byte[] into a JSON text and back. The only issue was
>> CASSANDRA-4081 where "assume" doesn't work with custom types. If
>> CASSANDRA-4081 gets fixed, I'll get the best of both worlds.
>>
>> Also advantages of this vs. the thrift based Super Column families are:
>>
>> 1. Saving extra CPU usage on the Cassandra nodes. Since
>> serialize/deserialize and compression/decompression happens on the
>>client
>> nodes where there is plenty idle CPU time
>>
>> 2. Saving network bandwidth since I'm sending over a compressed byte[]
>>
>>
>> -- Drew
>>
>>
>>
>> On Mar 29, 2012, at 11:16 AM, Jonathan Ellis wrote:
>>
>> > On Thu, Mar 29, 2012 at 1:11 PM, Drew Kutcharian <drew@venarc.com>
>> wrote:
>> >>> I think this is a much better approach because that gives you the
>> >>> ability to update or retrieve just parts of objects efficiently,
>> >>> rather than making column values just blobs with a bunch of special
>> >>> case logic to introspect them.  Which feels like a big step
>>backwards
>> >>> to me.
>> >>
>> >> Unless your access pattern involves reading/writing the whole
>>document
>> each time. In that case you're better off serializing the whole document
>> and storing it in a column as a byte[] without incurring the overhead of
>> column indexes. Right?
>> >
>> > Hmm, not sure what you're thinking of there.
>> >
>> > If you mean the "index" that's part of the row header for random
>> > access within a row, then no, serializing to byte[] doesn't save you
>> > anything.
>> >
>> > If you mean secondary indexes, don't declare any if you don't want
>>any.
>> :)
>> >
>> > Just telling C* to store a byte[] *will* be slightly lighter-weight
>> > than giving it named columns, but we're talking negligible compared to
>> > the overhead of actually moving the data on or off disk in the first
>> > place.  Not even close to being worth giving up being able to deal
>> > with your data from standard tools like cqlsh, IMO.
>> >
>> > --
>> > Jonathan Ellis
>> > Project Chair, Apache Cassandra
>> > co-founder of DataStax, the source for professional Cassandra support
>> > http://www.datastax.com
>>
>>



Mime
View raw message