hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Evans <ev...@yahoo-inc.com>
Subject Re: How to send objects to map task?
Date Wed, 28 Sep 2011 15:03:20 GMT
Pig does serialize some classes out to the jobConf (I believe that it is a writeable with base64
encoding to turn the bytes into chars).  This has been problematic in the past because there
are resource limits placed on the jobConf so that it does not use up too much memory on the
job tracker.  If it is just a small amount of data then jobConf is probably the simplest place
to put it.  If it starts to get large then I would suggest that you write it out to HDFS with
a high replication factor, and send it through the distributed cache.  The job conf is just
a file written to HDFS that is sent through the distributed cache to be processed.

--Bobby Evans

On 9/27/11 5:42 PM, "Zhiwei Xiao" <zwxiao@gmail.com> wrote:


My application needs to send some objects to map tasks, which specify how to process the input
records. I know I can transfer them as string via the configuration file. But I prefer to
leverage hadoop Writable interface, since the objects require a recursive serialization.

I tried to create a subclass of FileSplit to convey the data, but finally I found that it's
not elegant to implement. Because the FileSplits are initialized in getSplits() of InputFormat,
while the only way to initialize the InputFormat is via the setConf(). So I have to end up
implementing 3 new subclass with the same custom fields: FileSplit, InputFormat and Configuration.

Another approach may be to write these objects to a file on the HDFS or DistributedCache.

I just wonder is there a better way to do this job?

Thank you.
Zhiwei Xiao

View raw message