hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jonathan Gray (JIRA)" <j...@apache.org>
Subject [jira] Updated: (HBASE-1246) BloomFilter's use of BitSet is too inefficient
Date Fri, 06 Mar 2009 20:21:56 GMT

     [ https://issues.apache.org/jira/browse/HBASE-1246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Jonathan Gray updated HBASE-1246:
---------------------------------

    Attachment: ByteBloomFilter.java

This is a custom bloom implementation we have used internally.  I have modified it to use
a byte[] to back the bit vector rather than a BitSet.

It also has an optimized/efficient Writable implementation (and static serialize/deserialize
that we use).

The way I'm actually doing the hashing is probably non-optimal.  Our use of it always took
a long as the unit.  We need (byte[],offset,len) as the unit.  This implementation generates
an MD5 hash, puts that into a long, instantiates a Random with the md5 long as the seed, and
then uses Random.nextInt() for each hash.  Should take this and pair it with Jenkins or whatever
is being used up in Hadoop maybe.

main() contains some basic unit testing.  Everything does appear to work and is WAY more efficient.

{code}
Bloom filter initialized, 9585 bits, 1199 bytes, 7 hashes, 0.01 error rate, up to 1000 elements
Bloom bit vector contains 9585 bits
Serialized as 1230 bytes
{code}

In-memory size will be similarly efficient with the byte[] instead of the BitSet or boolean[]
as before.

> BloomFilter's use of BitSet is too inefficient
> ----------------------------------------------
>
>                 Key: HBASE-1246
>                 URL: https://issues.apache.org/jira/browse/HBASE-1246
>             Project: Hadoop HBase
>          Issue Type: Bug
>    Affects Versions: 0.20.0
>         Environment: Java 1.6, OSX 64 bit
>            Reporter: ryan rawson
>            Assignee: ryan rawson
>             Fix For: 0.20.0
>
>         Attachments: ByteBloomFilter.java
>
>
> From the logfile run of TestBloomFilter with special SizeOf agent jar:
> Writing bloom filter for: hdfs://localhost:64003/user/ryan/testComputedParameters/1278366260/contents/6159869037185296839
for size: 100
> 2009-03-06 01:54:25,491 DEBUG [RegionServer:0.cacheFlusher] regionserver.StoreFile$StoreFileWriter(319):
New bloom filter: vectorSize: 1175 hash_count: 5 numKeys: 100
> Serialized bloomfilter size: 160
> In memory bf size: 1248
> As we can see, the bit vector is 1175 bits, and the serialized size is fairly compact
- 160 bytes.
> But the in-memory size is nearly 10x bigger than it has to be.  Looking in BloomFilter
we see:
>   BitSet bits;
> is the only field.
> Clearly it seems the BitSet is using 1 byte = 1 bit.  That is an 8 time expansion of
where we should be.
> Considering every HFile could potentially have a bloom filter, and bloom filters are
more likely to have bit vector sizes of 10,000-100,000, we should do something about this.
 Aka: write our own bit-set that uses byte[] and bit ops.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message