hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <tdunn...@veoh.com>
Subject Re: HashMap which can spill to disk for Hadoop?
Date Wed, 19 Dec 2007 23:04:55 GMT


You should also be able get quite a bit of mileage out of special purpose
HashMaps.  In general, java generic collections incur large to huge
penalties for certain special cases.  If you have one of these special cases
or can put up with one, then you may be able to get 1+ order of magnitude
improvement.

Take as an example a hashed set of integers.  A HashSet<Integer> will
consume about 40 bytes per entry and each Integer will consume considerably
space as well.  You can code a customized multiple hash set for integers
only that consumes 4 * alpha bytes were alpha ~= 2.  If you really, really
wanted a set of integers and are willing to have small errors on the count
of distinct items, then this set of integers will suffice nicely.

Now, a factor of 15 improvement isn't anything to sneeze at, but it isn't
the same as unlimited.


On 12/19/07 11:58 AM, "C G" <parallelguy@yahoo.com> wrote:

> Hi All:
>    
>   The aggregation classes in Hadoop use a HashMap to hold unique values in
> memory when computing unique counts, etc.  I ran into a situation on 32-node
> grid (4G memory/node) where a single node runs out of memory within the reduce
> phase trying to manage a very large HashMap.  This was disappointing because
> the dataset is only 44M rows (4G) of data.  This is a scenario where I am
> counting unique values associated with various events, where the total number
> of events is very small and the number of unique values is very high.  Since
> the event IDs serve as keys as the number of distinct event IDs is small,
> there is a consequently small number of reducers running, where each reducer
> is expected to manage a very large HashMap of unique values.
>    
>   It looks like I need to build my own unique aggregator, so I am looking for
> an implementation of HashMap which can spill to disk as needed.  I've
> considered using BDB as a backing store, and I've looking into Derby's
> BackingStoreHashtable as well.
>    
>   For the present time I can restructure my data in an attempt to get more
> reducers to run, but I can see in the very near future where even that will
> run out of memory.
>    
>   Any thoughts,comments, or flames?
>    
>   Thanks,
>   C G
>    
> 
>        
> ---------------------------------
> Looking for last minute shopping deals?  Find them fast with Yahoo! Search.


Mime
View raw message