incubator-cassandra-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Evan Weaver <ewea...@gmail.com>
Subject Re: Symbolizing column names for storage and cache efficiency
Date Sun, 26 Jul 2009 22:37:40 GMT
As a related example, a columnar database I know uses RLE compression
on value streams because it's transparent. This makes calculating
statistical features of a compressed column faster than if it was
uncompressed in many use cases.

Since we have unpredictable column offsets, column-specific lookups
would benefit from similar transparency considerations.

Evan

On Sun, Jul 26, 2009 at 3:03 PM, Stu Hood<stuart.hood@rackspace.com> wrote:
>> blobs through LZW or similar. Aggregation operations benefit in
>> particular because you can often never even bother to decompress the
>> rows.
> This is an interesting consideration. Hopefully a suitably flexible implementation of
pluggable codecs would be able to allow for 'keys only' compression transparently. Your aggregation
example could be very optimized in this case.
>
> PS: When I envision SSTable compression, I'm definitely thinking of block level compression,
so each codec would need to implement seek() at a block level. The SSTable index would then
point at the first compressed block containing a key.
>
>
> -----Original Message-----
> From: "Evan Weaver" <eweaver@gmail.com>
> Sent: Sunday, July 26, 2009 5:23pm
> To: cassandra-dev@incubator.apache.org
> Subject: Re: Symbolizing column names for storage and cache efficiency
>
> Re. Jonathan, I haven't run across a row-oriented use case where
> symbolizing merely the first 1000 column names seen would not work.
>
> Re. Stu, If generalized compression can cover this case that should be
> fine....burn some CPU for a more straightforward implementation.
>
> However, it's often very useful in databases to have transparent
> compression (that is, operations can be performed on the data even in
> its compressed state). So I would advocate not merely passing the row
> blobs through LZW or similar. Aggregation operations benefit in
> particular because you can often never even bother to decompress the
> rows.
>
> This isn't relevant with current Cassandra, but could be a boon to
> in-database stored procedures and the like.
>
> Evan
> On Sun, Jul 26, 2009 at 2:11 PM, Stu Hood<stuart.hood@rackspace.com> wrote:
>> Also, long term, I think it is safe to assume that we will be adding compression
for ColumnFamilies, which should have similar positive effects on cache-ability without too
much application specific optimization.
>>
>>
>> -----Original Message-----
>> From: "Jonathan Ellis" <jbellis@gmail.com>
>> Sent: Sunday, July 26, 2009 4:46pm
>> To: cassandra-dev@incubator.apache.org
>> Subject: Re: Symbolizing column names for storage and cache efficiency
>>
>> On Sun, Jul 26, 2009 at 2:28 AM, Evan Weaver<eweaver@gmail.com> wrote:
>>> Would it be possible to add symbolized column names in a
>>> forward-compatible way? Maybe scoped per sstable, with the registries
>>> always kept in memory.
>>
>> Maybe.  But it's not obvious to me how to do this in general.
>>
>> The problem is the sparse nature of the column set.  We can't encode
>> _all_ the columns this way, or in the degenerate case we OOM just
>> trying to keep the mapping in memory.  Similarly, we can't encode just
>> the top N column names, since figuring out the top N requires keeping
>> each name in memory during the counting process.  (Besides slowing
>> down compaction -- instead of just deserializing columns where there
>> are keys in common in the merged fragments, we have to deserialize
>> all.)
>>
>> ISTM that all we can do is encode the _first_ N column names we see,
>> which may be useful if the column name set is small for a given CF.
>>
>> -Jonathan
>>
>>
>>
>
>
>
> --
> Evan Weaver
>
>
>



-- 
Evan Weaver

Mime
View raw message