hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jim Kellerman <...@powerset.com>
Subject RE: HashMap which can spill to disk for Hadoop?
Date Wed, 19 Dec 2007 20:08:42 GMT
Have you looked at hadoop.io.MapWritable?

Jim Kellerman, Senior Engineer; Powerset

> -----Original Message-----
> From: C G [mailto:parallelguy@yahoo.com]
> Sent: Wednesday, December 19, 2007 11:59 AM
> To: hadoop-user@lucene.apache.org
> Subject: HashMap which can spill to disk for Hadoop?
> Hi All:
>   The aggregation classes in Hadoop use a HashMap to hold
> unique values in memory when computing unique counts, etc.  I
> ran into a situation on 32-node grid (4G memory/node) where a
> single node runs out of memory within the reduce phase trying
> to manage a very large HashMap.  This was disappointing
> because the dataset is only 44M rows (4G) of data.  This is a
> scenario where I am counting unique values associated with
> various events, where the total number of events is very
> small and the number of unique values is very high.  Since
> the event IDs serve as keys as the number of distinct event
> IDs is small, there is a consequently small number of
> reducers running, where each reducer is expected to manage a
> very large HashMap of unique values.
>   It looks like I need to build my own unique aggregator, so
> I am looking for an implementation of HashMap which can spill
> to disk as needed.  I've considered using BDB as a backing
> store, and I've looking into Derby's BackingStoreHashtable as well.
>   For the present time I can restructure my data in an
> attempt to get more reducers to run, but I can see in the
> very near future where even that will run out of memory.
>   Any thoughts,comments, or flames?
>   Thanks,
>   C G
> ---------------------------------
> Looking for last minute shopping deals?  Find them fast with
> Yahoo! Search.

View raw message