hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jonathan Gray (JIRA)" <j...@apache.org>
Subject [jira] Updated: (HBASE-1246) BloomFilter's use of BitSet is too inefficient
Date Fri, 06 Mar 2009 20:49:58 GMT

     [ https://issues.apache.org/jira/browse/HBASE-1246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Jonathan Gray updated HBASE-1246:

    Attachment: ByteBloomFilter.java

Contains bitvals array for computation efficiency.  If I have a static final in a class, is
it the case that we don't pay the memory overhead for each object instantiation?

This also now implements HeapSize.  Currently does not include size of lookup array.

Bloom bit vector contains 9585 bits
Serialized as 1230 bytes
HeapSize is 1255

> BloomFilter's use of BitSet is too inefficient
> ----------------------------------------------
>                 Key: HBASE-1246
>                 URL: https://issues.apache.org/jira/browse/HBASE-1246
>             Project: Hadoop HBase
>          Issue Type: Bug
>    Affects Versions: 0.20.0
>         Environment: Java 1.6, OSX 64 bit
>            Reporter: ryan rawson
>            Assignee: ryan rawson
>             Fix For: 0.20.0
>         Attachments: ByteBloomFilter.java
> From the logfile run of TestBloomFilter with special SizeOf agent jar:
> Writing bloom filter for: hdfs://localhost:64003/user/ryan/testComputedParameters/1278366260/contents/6159869037185296839
for size: 100
> 2009-03-06 01:54:25,491 DEBUG [RegionServer:0.cacheFlusher] regionserver.StoreFile$StoreFileWriter(319):
New bloom filter: vectorSize: 1175 hash_count: 5 numKeys: 100
> Serialized bloomfilter size: 160
> In memory bf size: 1248
> As we can see, the bit vector is 1175 bits, and the serialized size is fairly compact
- 160 bytes.
> But the in-memory size is nearly 10x bigger than it has to be.  Looking in BloomFilter
we see:
>   BitSet bits;
> is the only field.
> Clearly it seems the BitSet is using 1 byte = 1 bit.  That is an 8 time expansion of
where we should be.
> Considering every HFile could potentially have a bloom filter, and bloom filters are
more likely to have bit vector sizes of 10,000-100,000, we should do something about this.
 Aka: write our own bit-set that uses byte[] and bit ops.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message