nutch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Markus Jelsma <markus.jel...@openindex.io>
Subject Re: Very large filter lists
Date Mon, 05 Dec 2011 17:37:25 GMT
We use bloom filters as well but instead of having a domain filter, for which 
a bloom filter would be a good choice, we have a sub domain normalizer. We 
need to look-up a key and get something back.

Now, i've checked the code again and both normalizers, filters are 
instantiated in each thread. This causes significant additional heap space.

Are there any objections for sharing them between threads? I assume things 
will get a lot slower. Or could i just share the HashMap between instances? 
Suggestions?

This is about a custom fetcher that does parsing and outlink processing as 
well.

On Wednesday 30 November 2011 22:41:58 Andrzej Bialecki wrote:
> There's an implementation of Bloom filter in Hadoop. Since the number of 
> items is known in advance it's possible to pick the right size of the 
> filter to keep the error rate at acceptable level.
> 
> One trick that you may consider when using Bloom filters is to have an 
> additional list of exceptions, i.e. common items that give false 
> positives. If you properly balance the size of the filter and the size 
> of the exception list you can still keep the total size of the structure 
> down while improving the error rate.

-- 
Markus Jelsma - CTO - Openindex

Mime
View raw message