incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Daniel Doubleday <daniel.double...@gmx.net>
Subject Problematic usage pattern
Date Wed, 22 Dec 2010 14:50:14 GMT
Hi all

wanted to share a cassandra usage pattern you might want to avoid (if you can).

The combinations of 

- heavy rows,
- large volume and
- many updates (overwriting columns)

will lead to a higher count of live ssts (at least if you're not starting mayor compactions
a lot) with many ssts actually containing the same hot rows. 
This will lead to loads of multiple reads that will increase latency and io pressure by itself
and making the page cache less effective because it will contains loads of 'invalid' data.

In our case we could reduce reads by ~40%. Our rows contained one large column (1-4k) and
some 50 - 100 small columns.
We splitted into 2 CFs and stored the large column in the other CF with a UUID as row key
which we store in CF1. We cache the now light-weight rows in the row cache which eliminates
the update problem and instead of updating the large column we create a new row and delete
the other one. That way the bloom filter prevents unnecessary reads. 

The downside is that to read the large column from CF2 we have to read CF1 first but since
that one is in the row cache that still way better.

To monitor this we did a very small patch which records the file scans for a CF in a histogram
in a similar way as the latency stats.

If someone's interested - here is the patch agains 0.6.8:

https://gist.github.com/751601

Cheers,
Daniel
smeet.com, Berlin




Mime
View raw message