hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Owen O'Malley <owen.omal...@gmail.com>
Subject Re: Passing information to Map Reduce
Date Sat, 14 Aug 2010 00:55:05 GMT
Use Sequence Files if the objects are Writable. Otherwise, you can use the Java serialization.
I'm working on a patch to allow Protocol Buffers, Thrift, Writables, Java serialization, and
Avro in Sequence Files. 

-- Owen

On Aug 13, 2010, at 17:41, Pete Tyler <peteralantyler@gmail.com> wrote:

> Distributed cache looks hopeful. However, at first glance it looks good for distributing
files but not instance data. Ideally I'm looking for something similar to, say, objects being
passed between client and server by RMI.
> -Pete
> On Aug 13, 2010, at 3:15 PM, Owen O'Malley <omalley@apache.org> wrote:
>> On Aug 13, 2010, at 12:55 PM, Pete Tyler wrote:
>>> I have only found two options, neither of which I really like,
>>> 1. Encode information in the job name string - a bit hokey and limited to strings
>> I'd state this as encode the information into a string and add it to the JobConf.
Look at the Base64 class if you want to uuencode your data. This is easiest, but causes problems
if the JobConf gets much above 2MB or so.
>>> 2. Persist the information, which changes from job to job - if every map task
and every reduce task has to read one piece if specific, persisted data that may be stored
on another node won't this have significant performance implications?
>> This is generally the preferred strategy. In particular, the framework supports the
"distributed cache" which will cause files from HDFS to be downloaded to each node before
the tasks run. The files will only be downloaded once for each node. Files in the distributed
cache can be a couple GB without huge performance problems.
>> -- Owen

View raw message