hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrzej Bialecki ...@getopt.org>
Subject Re: Best practice for in memory data?
Date Thu, 25 Jan 2007 20:04:42 GMT
Johan Oskarsson wrote:
> Hi.
>
> Currently some of my map reduce jobs need quick access to additional 
> data to check some input values in the map phase.
>
> This data is currently held in memory in a hashmap. It's very quick 
> but as each job starts several jvms the data will be held in memory 
> multiple times. It will also mean I have to increase the memory each 
> task uses. This in turn leads to out of memory problems if too many 
> memory intensive tasks are run resulting in the job being lost.
>
> One alternative would be to use a mapfile, but they're obviously much 
> slower. The solution I'm considering is to use a hashmap
> as the in memory cache and a mapfile as the underlying data source.
>
> I've read the javadoc on DistributedCache, but that seems to only deal 
> with distributing the actual data, not on how to do fast reading from it.
>
> Any advice on how to solve this problem?
> Would it be possible to somehow share a hashmap between tasks?

Here's two practical tips:

* instead of a HashMap you could use a trie, e.g. a prefix trie (there 
is an implementation of this in Nutch). If values are short I load them 
directly to a trie, if they are long I compute a hash (e.g. minimal 
perfect hash) and store only hashes in a trie.

* for some types of problems (mostly when wanted values are a small 
subset of all possible input values) I use Bloom filters to quickly 
check if the current value is likely to match, and then consult a trie, 
or in the worst case I check this with an external MapFile. However, if 
the current value is absent in the Bloom filter I don't have to check 
anything, which is a win.

Hope this helps ...

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Mime
View raw message