hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Johan Oskarsson <jo...@oskarsson.nu>
Subject Best practice for in memory data?
Date Sun, 21 Jan 2007 15:59:10 GMT

Currently some of my map reduce jobs need quick access to additional 
data to check some input values in the map phase.

This data is currently held in memory in a hashmap. It's very quick but 
as each job starts several jvms the data will be held in memory multiple 
times. It will also mean I have to increase the memory each task uses. 
This in turn leads to out of memory problems if too many memory 
intensive tasks are run resulting in the job being lost.

One alternative would be to use a mapfile, but they're obviously much 
slower. The solution I'm considering is to use a hashmap
as the in memory cache and a mapfile as the underlying data source.

I've read the javadoc on DistributedCache, but that seems to only deal 
with distributing the actual data, not on how to do fast reading from it.

Any advice on how to solve this problem?
Would it be possible to somehow share a hashmap between tasks?

A big thanks to the hadoop team for the hard work they put into this.


View raw message