hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jim Kellerman (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-1757) [hbase] Bloomfilters: single argument constructor, use enum for bloom filter types
Date Wed, 22 Aug 2007 18:00:34 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-1757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12521877
] 

Jim Kellerman commented on HADOOP-1757:
---------------------------------------

> 1. Non-existence of a single argument constructor
>
> When creating an instance of the BloomFilterDescriptor class,
> I need to specify some options for the newly created bloom filter.
> There are three options: type, vectorSize, nbHash.
> I know that these options are important for
> the internal working of a bloom filter, but I cannot help
> but confess that I don't really understand
> what vectorSize and hbHash mean and how these two options
> affect the way in which a bloom filter works.
> As the user of a bloom filter, the only thing I am concerned with is
> the first option, the name of the bloom filter that I'd like to
> use for the column, and it would be nice if the other options
> are automatically decided and filled in.

There no way to automatically determine the vector size and the number of hash functions to
use. In particular, bloom filters are very sensitive to the number of elements inserted into
them. For HBase, the number of entries depends on the size of the data stored in the column.
Currently the default region size is 64MB, so the number of entries is approximately the 64MB
/ (average value size for column).

If m denotes the number of bits in the Bloom filter (vectorSize), n denotes the number of
elements inserted into the Bloom filter and k represents the number of hash functions used
(nbHash), then according to Broder and Mitzenmacher,

( http://www.eecs.harvard.edu/~michaelm/NEWWORK/postscripts/BloomFilterSurvey.pdf )

the probability of false positives is minimized when k is approximately m/n ln(2).

So we could provide a constructor that takes two arguments:
- bloom filter type
- estimated number of entries

Would that be acceptable?



> [hbase] Bloomfilters: single argument constructor, use enum for bloom filter types
> ----------------------------------------------------------------------------------
>
>                 Key: HADOOP-1757
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1757
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: contrib/hbase
>    Affects Versions: 0.15.0
>            Reporter: Jim Kellerman
>            Assignee: Jim Kellerman
>            Priority: Minor
>             Fix For: 0.15.0
>
>
> On Thu, 2007-08-23 at 01:46 +0900, Inchul Song wrote:
> Hi all,
> > 
> > When I create a column for an Hbase table, I have to create
> > an instance of the HColumnDescriptor class, and
> > pass over an instance of the BloomFilterDescriptor class
> > describing which bloom filter to use to the constructor
> > of the HColumnDescriptor class.
> > 
> > But there is some inconvenience in using the BloomFilterDescriptor class:
> > 
> > 1. Non-existence of a single argument constructor
> > 
> > When creating an instance of the BloomFilterDescriptor class,
> > I need to specify some options for the newly created bloom filter.
> > There are three options: type, vectorSize, nbHash.
> > I know that these options are important for
> > the internal working of a bloom filter, but I cannot help
> > but confess that I don't really understand
> > what vectorSize and hbHash mean and how these two options
> > affect the way in which a bloom filter works.
> > As the user of a bloom filter, the only thing I am concerned with is
> > the first option, the name of the bloom filter that I'd like to
> > use for the column, and it would be nice if the other options
> > are automatically decided and filled in.
> > 
> > So it would be nice if there is a constructor
> > with a single 'type' argument in the BloomFilterDescriptor class.
> > 
> > 2. Bloom filter types are defined as integers
> > 
> > Bloom filter types are not in an enumeration class.
> > Thus, when filling in the type option of the constructor
> > from a String value, I always have to write some translation
> > code from the string value to one of the integer values
> > representing bloom filter types.
> > 
> > If there is an enumeration class containing bloom filter types,
> > I can utilize the valueOf method of the enumeration class
> > to do this tedious job.
> > 
> > Thanks,
> > 
> > Song

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message