More insight for this sstable: the ArrayList for IndexSummary has 644195 entries, so total number of entries for this sstable is: 644195*128=~82mil. The problem is that the total bits for its BloomFilter (long[19400551] inside BitSet) is 19400551*64=1241635264, which means each key is taking ~15bits. This seems to be inline with the number of buckets in sstable writer. I'm making changes to make this bucket number to be configurable so as to have more control about memory usage.

-Weijun

On Tue, May 4, 2010 at 1:50 PM, Jonathan Ellis <jbellis@gmail.com> wrote:
BloomFilter is not redundant, because it stores information about
_all_ keys while the index summary stores every 1/128 key.

On Tue, May 4, 2010 at 3:47 PM, Weijun Li <weijunli@gmail.com> wrote:
> Hello,
>
> We stored about 47mil keys in one Cassandra node and what a memory dump
> shows for one of the SStableReader:
>
>     SSTableReader: 386MB. Among this 386MB, IndexSummary takes about 231MB
> but BloomFilter takes 155MB with an embedded huge array long[19.4mil].
>
> It seems that BloomFilter is taking too much memory. If this is the case
> BloomFilter seems to be redundant comparing to the size of index.
>
> So is this desired behavior? Is there a formula to estimate the size of
> needed memory for BloomFilter?
>
> Thanks,
>
> -Weijun
>
>



--
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of Riptano, the source for professional Cassandra support
http://riptano.com