hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chase Bradford <chase.bradf...@gmail.com>
Subject Re: How to architecture large initialization in map task
Date Mon, 13 Sep 2010 20:43:03 GMT
 On 09/10/2010 02:18 AM, Angus Helm wrote:
>
> Hi all, I have a task which involves loading a large amount of data
> from a database and then using that data to process a large number of
> small files. I'm trying to split up the file processing via mapreduce,
> so each task runs as a map. However, the "loading from a database"
> part takes a long time and does not need to be done for each map task.
> Preferably it would be done once on each task node and then the task
> node would do all of the map reduces with the same data set. Currently
> the load from database takes several minutes while the processing of
> files takes a few seconds. So when it loads data for every task it's
> orders of magnitude slower.
>
> My question is if there is a well known best practice for doing
> something like this.

If it's just the raw setup of internal data structures (eg. read into
hashmap), and not the actual source (ie. DistributedCache won't be any
better), then there are a few things you can do to reduce the number
of spawned tasks.

The first one I'd recommend is the CombineFileInputFormat
http://hadoop.apache.org/mapreduce/docs/current/api/org/apache/hadoop/mapred/lib/CombineFileInputFormat.html

It will send multi files or blocks to a single task, which should
alleviate some of the overhead.

If that's not an option, then you could try combining your files into
few files with large blocks by using a simple job with an
IdentityMapper and Reducer, before running your real job.

Chase

Mime
View raw message