cassandra-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Carl Bruecken <>
Subject Re: improving read performance
Date Mon, 20 Sep 2010 17:31:48 GMT

  On 9/20/10 12:47 PM, Peter Schuller wrote:
>> This drawback is unfortunate for systems that use time-based row keys.    In
>> such systems, row data will generally not be fragmented very much, if at
>> all, but reads suffer because the assumption is that all data is fragmented.
>>     Even further, in a real-time system where reads occur quickly after
>> writes, if the data is in memory, the sstables are still checked.
> Perhaps I am misunderstanding you, but why is this a problem (in the
> particular case of time based row keys) given that existence of the
> bloom filters which should eliminate the need to go down to the
> sstables to any extent more than that they actually contain data for
> the row (in almost all cases, subject to bloom filter false
> positives)?
> Also, for the case of the edges where memtables are flushed, a
> write-through row cache should help alleviate that. I forget off hand
> whether the row cache is in fact write-through or not though.

Actually, the points you make are things I have overlooked and actually 
make me feel more comfortable about how cassandra will perform for my 
use cases.   I'm interested, in my case, to find out what the bloom 
filter false-positive rate is.   Hopefully, a stat is kept on this.   As 
long as ALL of the bloom filters are in memory, the hit should be 
minimal  for a false positive, since the index read should subsequently 
reveal the row to not be in the correspending SSTABLE.

Good point on the row cache.   I had actually misread the comments in 
the yaml, mistaking "do not use on ColumnFamilies with LARGE ROWS" , as 
"do not use on ColumnFamilies with a LARGE NUMBER OF ROWS".    I don't 
know if this will improve performance much since I don't understand yet 
if this eliminates the need to check for the data in the SStables.   If 
it doesn't then what is the point of the row cache since the data is 
also in an in-memory memtable?

That aside, splitting the memtable in 2, could make checking the bloom 
filters unnecessary in most cases for me, but I'm not sure it's worth 
the effort.

View raw message