incubator-cassandra-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Drew Kutcharian <d...@venarc.com>
Subject Re: Document storage
Date Thu, 29 Mar 2012 18:30:58 GMT
Yes, I meant the "row header index". What I have done is that I'm storing an object (i.e. UserProfile)
where you read or write it as a whole (a user updates their user details in a single page
in the UI). So I serialize that object into a binary JSON using SMILE format. I then compress
it using Snappy on the client side. So as far as Cassandra cares it's storing a byte[].

Now on the client side, I'm using cassandra-cli with a custom type that knows how to turn
a byte[] into a JSON text and back. The only issue was CASSANDRA-4081 where "assume" doesn't
work with custom types. If CASSANDRA-4081 gets fixed, I'll get the best of both worlds.

Also advantages of this vs. the thrift based Super Column families are:

1. Saving extra CPU usage on the Cassandra nodes. Since serialize/deserialize and compression/decompression
happens on the client nodes where there is plenty idle CPU time

2. Saving network bandwidth since I'm sending over a compressed byte[]


-- Drew



On Mar 29, 2012, at 11:16 AM, Jonathan Ellis wrote:

> On Thu, Mar 29, 2012 at 1:11 PM, Drew Kutcharian <drew@venarc.com> wrote:
>>> I think this is a much better approach because that gives you the
>>> ability to update or retrieve just parts of objects efficiently,
>>> rather than making column values just blobs with a bunch of special
>>> case logic to introspect them.  Which feels like a big step backwards
>>> to me.
>> 
>> Unless your access pattern involves reading/writing the whole document each time.
In that case you're better off serializing the whole document and storing it in a column as
a byte[] without incurring the overhead of column indexes. Right?
> 
> Hmm, not sure what you're thinking of there.
> 
> If you mean the "index" that's part of the row header for random
> access within a row, then no, serializing to byte[] doesn't save you
> anything.
> 
> If you mean secondary indexes, don't declare any if you don't want any. :)
> 
> Just telling C* to store a byte[] *will* be slightly lighter-weight
> than giving it named columns, but we're talking negligible compared to
> the overhead of actually moving the data on or off disk in the first
> place.  Not even close to being worth giving up being able to deal
> with your data from standard tools like cqlsh, IMO.
> 
> -- 
> Jonathan Ellis
> Project Chair, Apache Cassandra
> co-founder of DataStax, the source for professional Cassandra support
> http://www.datastax.com


Mime
View raw message