hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Yonatan Gottesman (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-11829) Improve the vector size of Bloom Filter from int to long, and storage from memory to disk
Date Thu, 25 May 2017 08:44:04 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-11829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16024406#comment-16024406
] 

Yonatan Gottesman commented on HADOOP-11829:
--------------------------------------------

so on each query, you load different parts of the bitset that you need and check there?


> Improve the vector size of Bloom Filter from int to long, and storage from memory to
disk
> -----------------------------------------------------------------------------------------
>
>                 Key: HADOOP-11829
>                 URL: https://issues.apache.org/jira/browse/HADOOP-11829
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: util
>            Reporter: Hongbo Xu
>            Assignee: Hongbo Xu
>            Priority: Minor
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> org.apache.hadoop.util.bloom.BloomFilter(int vectorSize, int nbHash, int hashType) 
> This filter almost can insert 900 million objects, when False Positives Probability is
0.0001, and it needs 2.1G RAM.
> In My project, I needs established a filter which capacity is 2 billion, and it needs
4.7G RAM, the vector size is 38340233509, out the range of int, and I does not have so much
RAM to do this, so I rebuild a big bloom filter which vector size type is long, and split
the bit data to some files on disk, then distribute files to work node, and the performance
is very good.
> I think I can contribute this code to Hadoop Common, and a 128-bit Hash function (MurmurHash)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org


Mime
View raw message