incubator-cassandra-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Evan Weaver <ewea...@gmail.com>
Subject Re: Symbolizing column names for storage and cache efficiency
Date Sun, 26 Jul 2009 21:23:37 GMT
Re. Jonathan, I haven't run across a row-oriented use case where
symbolizing merely the first 1000 column names seen would not work.

Re. Stu, If generalized compression can cover this case that should be
fine....burn some CPU for a more straightforward implementation.

However, it's often very useful in databases to have transparent
compression (that is, operations can be performed on the data even in
its compressed state). So I would advocate not merely passing the row
blobs through LZW or similar. Aggregation operations benefit in
particular because you can often never even bother to decompress the
rows.

This isn't relevant with current Cassandra, but could be a boon to
in-database stored procedures and the like.

Evan
On Sun, Jul 26, 2009 at 2:11 PM, Stu Hood<stuart.hood@rackspace.com> wrote:
> Also, long term, I think it is safe to assume that we will be adding compression for
ColumnFamilies, which should have similar positive effects on cache-ability without too much
application specific optimization.
>
>
> -----Original Message-----
> From: "Jonathan Ellis" <jbellis@gmail.com>
> Sent: Sunday, July 26, 2009 4:46pm
> To: cassandra-dev@incubator.apache.org
> Subject: Re: Symbolizing column names for storage and cache efficiency
>
> On Sun, Jul 26, 2009 at 2:28 AM, Evan Weaver<eweaver@gmail.com> wrote:
>> Would it be possible to add symbolized column names in a
>> forward-compatible way? Maybe scoped per sstable, with the registries
>> always kept in memory.
>
> Maybe.  But it's not obvious to me how to do this in general.
>
> The problem is the sparse nature of the column set.  We can't encode
> _all_ the columns this way, or in the degenerate case we OOM just
> trying to keep the mapping in memory.  Similarly, we can't encode just
> the top N column names, since figuring out the top N requires keeping
> each name in memory during the counting process.  (Besides slowing
> down compaction -- instead of just deserializing columns where there
> are keys in common in the merged fragments, we have to deserialize
> all.)
>
> ISTM that all we can do is encode the _first_ N column names we see,
> which may be useful if the column name set is small for a given CF.
>
> -Jonathan
>
>
>



-- 
Evan Weaver

Mime
View raw message