incubator-cassandra-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Stu Hood" <stuart.h...@rackspace.com>
Subject Re: Symbolizing column names for storage and cache efficiency
Date Sun, 26 Jul 2009 22:03:46 GMT
> blobs through LZW or similar. Aggregation operations benefit in
> particular because you can often never even bother to decompress the
> rows.
This is an interesting consideration. Hopefully a suitably flexible implementation of pluggable
codecs would be able to allow for 'keys only' compression transparently. Your aggregation
example could be very optimized in this case.

PS: When I envision SSTable compression, I'm definitely thinking of block level compression,
so each codec would need to implement seek() at a block level. The SSTable index would then
point at the first compressed block containing a key.


-----Original Message-----
From: "Evan Weaver" <eweaver@gmail.com>
Sent: Sunday, July 26, 2009 5:23pm
To: cassandra-dev@incubator.apache.org
Subject: Re: Symbolizing column names for storage and cache efficiency

Re. Jonathan, I haven't run across a row-oriented use case where
symbolizing merely the first 1000 column names seen would not work.

Re. Stu, If generalized compression can cover this case that should be
fine....burn some CPU for a more straightforward implementation.

However, it's often very useful in databases to have transparent
compression (that is, operations can be performed on the data even in
its compressed state). So I would advocate not merely passing the row
blobs through LZW or similar. Aggregation operations benefit in
particular because you can often never even bother to decompress the
rows.

This isn't relevant with current Cassandra, but could be a boon to
in-database stored procedures and the like.

Evan
On Sun, Jul 26, 2009 at 2:11 PM, Stu Hood<stuart.hood@rackspace.com> wrote:
> Also, long term, I think it is safe to assume that we will be adding compression for
ColumnFamilies, which should have similar positive effects on cache-ability without too much
application specific optimization.
>
>
> -----Original Message-----
> From: "Jonathan Ellis" <jbellis@gmail.com>
> Sent: Sunday, July 26, 2009 4:46pm
> To: cassandra-dev@incubator.apache.org
> Subject: Re: Symbolizing column names for storage and cache efficiency
>
> On Sun, Jul 26, 2009 at 2:28 AM, Evan Weaver<eweaver@gmail.com> wrote:
>> Would it be possible to add symbolized column names in a
>> forward-compatible way? Maybe scoped per sstable, with the registries
>> always kept in memory.
>
> Maybe.  But it's not obvious to me how to do this in general.
>
> The problem is the sparse nature of the column set.  We can't encode
> _all_ the columns this way, or in the degenerate case we OOM just
> trying to keep the mapping in memory.  Similarly, we can't encode just
> the top N column names, since figuring out the top N requires keeping
> each name in memory during the counting process.  (Besides slowing
> down compaction -- instead of just deserializing columns where there
> are keys in common in the merged fragments, we have to deserialize
> all.)
>
> ISTM that all we can do is encode the _first_ N column names we see,
> which may be useful if the column name set is small for a given CF.
>
> -Jonathan
>
>
>



-- 
Evan Weaver



Mime
View raw message