hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Harsh J (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAPREDUCE-2001) Enhancement to SequenceFileOutputFormat to allow user to set MetaData
Date Mon, 07 May 2012 07:16:19 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-2001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13269420#comment-13269420

Harsh J commented on MAPREDUCE-2001:

bq. The users would then in their mapper or reducer configure or setup method call SequenceFileOutputFormat.setMetadata
with the appropiate metadata object that they would create.

The problem exposed by this approach hits upon a possible inconsistency/bug in the framework:

| Record Writer Instantiation | Old API | New API |
| Map Task | Before Mapper | After Mapper |
| Reduce Task | After Reducer | After Reducer |

See MapTask.java/ReduceTask.java in 1.x for instance, methods run{Old/New}{Mapper/Reducer}.
This has been so now for a very long time, and I do think changing this may break behavior
of several users out there, including some of the code I've written at my former workplace.
Though yeah, its highly strange no spec doc exists for this, we ought to have one via another

Hence the mapper.configure() approach with a static method would unfortunately fail on the
old API runs, for map-only jobs.

bq. Then we make the SequenceFileOutputFormat JobConfigurable so that ReflectionUtils.newInstance
will call configure on it and load the metadata.

I imagine this working in a much better way. For new API users, they may still be able to
sneak in changes per map/reduce task, and otherwise (on Old API) rely on driver to provide
these up.

bq. I think we should avoid users having to subclass SequenceFileOutputFormat. Thoughts?

Agreed, given your new approach via jobconf. Lets also make sure we serialize with base64
encoding or so, to allow for special chars in metadata if users so wish it (cause job.xml
dislikes special chars).
> Enhancement to SequenceFileOutputFormat to allow user to set MetaData
> ---------------------------------------------------------------------
>                 Key: MAPREDUCE-2001
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2001
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>    Affects Versions: 0.20.2
>            Reporter: David Rosenstrauch
>            Priority: Minor
>         Attachments: MAPREDUCE-2001.patch
> The org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat class currently does
not provide a way for the user to pass in a MetaData object to be written to the SequenceFile.
> Currently he only way for a developer to implement this functionality appears to be to
create a subclass which overrides the SequenceFileOutputFormat's getRecordWriter() method,
which is a bit of a kludge.
> This seems to be a common enough request to warrant a fix of some sort.  (It's already
been brought up twice in the past year:  http://www.mail-archive.com/common-user@hadoop.apache.org/msg02198.html
and http://www.mail-archive.com/mapreduce-user@hadoop.apache.org/msg00904.html)
> A couple of possible solutions:
> 1) provide a static method SequenceFileOutputFormat.setMetaData(Job, MetaData)
> 2) Provide a (non-static) setMetaData() method on the SequenceFileOutputFormat class.
 The user would create a subclass of SequenceFileOutputFormat which, say, implements Configurable.
 Then in the setConf() method, the user could create the MetaData object (using data from
the Configuration), and then call setMetaData.  The SequenceFileOutputFormat would then use
this MetaData object when creating the SequenceFile.  (Note that the user would have to create
a subclass of SequenceFileOutputFormat to make this solution work.)

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira


View raw message