hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Anthony Urso <antho...@cs.ucla.edu>
Subject Re: Sharing data in a mapper for all values
Date Tue, 01 Nov 2011 03:52:20 GMT

If you have keyed both the big blob and the input files similarly, and
you can output both streams to HDFS sorted by key, then you can
reformulate this whole process as a map-side join.  It will be a lot
simpler and more efficient than scanning the whole blob for each

Also, do whatever loading you have to do in the constructor or the
configure method so save a lot of repetition.

Hope this helps,

On Mon, Oct 31, 2011 at 4:45 PM, Arko Provo Mukherjee
<arkoprovomukherjee@gmail.com> wrote:
> Hello,
> I have a situation where I am reading a big file from HDFS and then
> comparing all the data in that file with each input to the mapper.
> Now since my mapper is trying to read the entire HDFS file for each of its
> input, the amount of data it is having to read and keep in memory is
> becoming large (file size * no of inputs to the mapper)
> Can we someone avoid this by loading the file once for each mapper such that
> the mapper can reuse the loaded file for each of the inputs that it
> receives.
> If this can be done, then for each mapper, I can just load the file once and
> then the mapper can use it for the entire slice of data that it receives.
> Thanks a lot in advance!
> Warm regards
> Arko

View raw message