cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Terje Marthinussen (JIRA)" <>
Subject [jira] [Commented] (CASSANDRA-47) SSTable compression
Date Sat, 14 May 2011 05:41:48 GMT


Terje Marthinussen commented on CASSANDRA-47:

Just curious if any active work is done or planned near future on compressing larger data
blocks or is it all suspended waiting for a new sstable design?

Having played with compression of just supercolumns for a while, I am a bit tempted to test
out compression of larger blocks of data. At least row level compression seems reasonably
easy to do.

Some experiences so far which may be usefull:
- Compression on sstables may actually be helpfull on memory pressure, but with my current
implementation, non-batched update throughput may drop 50%.I am not 100% sure why actually.

- Flushing of (compressed) memtables and compactions are clear potential bottlenecks
The obvious trouble makers here is the fact that you ceep 

For really high pressure work, I think it would be usefull to only compress tables once they
pass a certain size to reduce the amount of recompression occuring on memtable flushes and
when compacting small sstables (which is generally not a big disk problem anyway)

This is a bit awkward when doing things like I do in the super columns as I believe the supercolumn
does not know anything about the data it is part of (except for recently, the deserializer
has that info through "inner".

It would anyway probably be cleaner to let the datastructures/methods using the SC decide
when to compress and noth 

- Working on a SC level, there seems to be some 10-15% extra compression on this specific
data if column names that are highly repetetive in SC's can be extracted into some meta data
structure so you only store references to these in the column names. That is, the final data
is goes from about 40% compression to 50% compression. 

I don't think the effect of this will be equally big with larger blocks, but I suspect there
should be some effect.

- total size reduction of the sstables I have in this project is currently in the 60-65% range.
It is mainly beneficial for those that have supercolumns with at least a handfull of columns
(400-600 bytes of serialized column data per sc at least)

- Reducing the meta data on columns by building a dictionary of timestamps as well as variable
length name/value length data (instead of fixed short/int) cuts down another 10% in my test
(I have just done a very quick simulation of this by a very quick "10 minute" hack on the

- We may want to look at how we can reuse whole compressed rows on compactions if for instance
the other tables you compact with do not have the same data

- We may want a new cache on the uncompressed disk chunks. In my case, I preserve the compressed
part of the supercolumn and 

In my supercolumn compression case, I have a cache for the compressed data so I can write
that back without recompression if not modified. This also makes calls to get the serialized
size cheaper (don't need to compress both to find serialized size and to actually serialize)

If people are interested in adding any of the above to current cassandra, I will try to get
time to make some of this up to a quality where it could be used by the general public. 

If not, I will wait for new sstables to get a bit more ready and see if I can contribute there

> SSTable compression
> -------------------
>                 Key: CASSANDRA-47
>                 URL:
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: Jonathan Ellis
>            Priority: Minor
>              Labels: compression
>             Fix For: 1.0
> We should be able to do SSTable compression which would trade CPU for I/O (almost always
a good trade).

This message is automatically generated by JIRA.
For more information on JIRA, see:

View raw message