Mailing-List: contact dev-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@cassandra.apache.org
Received-SPF: pass (athena.apache.org: domain of daniel.doubleday@gmx.net
 designates 213.165.64.23 as permitted sender)
From: Daniel Doubleday <daniel.doubleday@gmx.net>
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: quoted-printable
Subject: Re: Document storage
Date: Fri, 30 Mar 2012 18:01:51 +0200
Message-Id: <89A95CBC-F23B-45BD-A512-D7F4DB52E735@gmx.net>
To: dev@cassandra.apache.org
Mime-Version: 1.0 (Apple Message framework v1084)

> Just telling C* to store a byte[] *will* be slightly lighter-weight
> than giving it named columns, but we're talking negligible compared to
> the overhead of actually moving the data on or off disk in the first
> place.=20
Hm - but isn't this exactly the point? You don't want to move data off =
disk.
But decomposing into columns will lead to more of that:

- Total amount of serialized data is (in most cases a lot) larger than =
protobuffed / compressed version
- If you do selective updates the document will be scattered over =
multiple ssts plus if you do sliced reads you can't optimize reads as =
opposed to the single column version that when updated is automatically =
superseding older versions so most reads will hit only one sst

All these reads make the hot dataset. If it fits the page cache your =
fine. If it doesn't you need to buy more iron.

Really could not resist because your statement seems to be contrary to =
all our tests / learnings.

Cheers,
Daniel

=46rom dev list:

Re: Document storage
On Thu, Mar 29, 2012 at 1:11 PM, Drew Kutcharian <drew@venarc.com> =
wrote:
>> I think this is a much better approach because that gives you the
>> ability to update or retrieve just parts of objects efficiently,
>> rather than making column values just blobs with a bunch of special
>> case logic to introspect them.  Which feels like a big step backwards
>> to me.
>
> Unless your access pattern involves reading/writing the whole document =
each time. In
that case you're better off serializing the whole document and storing =
it in a column as a
byte[] without incurring the overhead of column indexes. Right?

Hmm, not sure what you're thinking of there.

If you mean the "index" that's part of the row header for random
access within a row, then no, serializing to byte[] doesn't save you
anything.

If you mean secondary indexes, don't declare any if you don't want any. =
:)

Just telling C* to store a byte[] *will* be slightly lighter-weight
than giving it named columns, but we're talking negligible compared to
the overhead of actually moving the data on or off disk in the first
place.  Not even close to being worth giving up being able to deal
with your data from standard tools like cqlsh, IMO.

--=20
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of DataStax, the source for professional Cassandra support
http://www.datastax.com