hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Harsh J <ha...@cloudera.com>
Subject Re: What happes if the caller mutates key or value after calling OutputCollector.collect(key,value) ?
Date Mon, 13 Feb 2012 05:53:10 GMT
Mike,

On Mon, Feb 13, 2012 at 10:31 AM, Mike Spreitzer <mspreitz@us.ibm.com> wrote:
> Ah, so my question was not clear.  Let me try again.  Suppose my map or
> reduce method invokes output.collect(key, value) and then, after that
> returns, mutates (side effects) either the key or the value --- what
> happens?  Is this specified, forbidden, or unspecified?  Is there a general
> answer, or is this up to the OutputFormat?

This is upto your RecordWriter. If you accumulate all passed key/value
objects as references to serialize later, you will run into issues
doing what you're talking about. The better thing would be to clone.

But all of the Hadoop provided implementations serialize, or create a
new object immediately upon call, and hence remain unaffected from
key/value object mutations at source.

> I think you are saying that for SequenceFileOutputFormat and
> TextOutputFormat, the scenario I outlined is allowed and the original values
> appear in the output.  Have I got that right?  If so, what remains to be
> answered then is whether this is something I have to answer for each
> OutputFormat or there is a general rule here.

OutputFormat and RecordWriters are mere interfaces to prepare and
connect you to the FS. You are free to write their implementations as
you want to -- the framework does not get in your way here.

> Suppose, in particular, that my map or reduce method calls
> output.collect(key,value) several times in series --- each time passing the
> same object reference for key, and each time passing the same object
> reference for value, but modifying those objects between calls on
> output.collect.  I would like to know if this is a supported scenario, with
> the semantics that what is output is the contents of the key and value
> objects at the moment output.collect(key,value) is called.

My first para explains this - every call to your record writer gets
serialized with the values passed in - so yes, you can call that a
supported scenario.

> Have I overlooked some documentation that answers my question?

Perhaps. Could you please identify the spot where docs can carry this
for programmers (RecordWriter API) and file an issue on JIRA with that
knowledge (and possibly a patch)? Thanks! :)

-- 
Harsh J
Customer Ops. Engineer
Cloudera | http://tiny.cloudera.com/about

Mime
View raw message