hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jay Booth (JIRA)" <j...@apache.org>
Subject [jira] Commented: (MAPREDUCE-326) The lowest level map-reduce APIs should be byte oriented
Date Thu, 04 Feb 2010 21:05:29 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12829779#action_12829779

Jay Booth commented on MAPREDUCE-326:

This sounds awesome.

Is this roughly the workflow you're envisioning?

1)  kernel starts map process, sends Split information via stdin
2)  framework reads in Split info, uses that to instantiate InputFormat and userland Mapper
class, runs Map with output going to stdout
3)  Kernel sends output to different partitions
4)  Kernel executes shuffle, framework/kernel does sort (TBD?  Maybe Kernel defaults to byte[]
comparison but allows Framework to override?)
5)  Kernel starts reduce process, framework reads some sort of ReduceSplit with partition
info, creates userland Reducer
6)  Framework executes userland Reducer, pipes output through kernel to reduce output location

Is that more or less accurate?  I think it'd be awesome, being able to run tasks in different
languages is going to become more and more important..  JRuby and Clojure are good to go right
now as far as a DFSClient, and other languages are doable via dfs -cat, as Doug said -- this
would be huge for us, reduce development time and make the logic of our MR jobs more accessible
to business people.

> The lowest level map-reduce APIs should be byte oriented
> --------------------------------------------------------
>                 Key: MAPREDUCE-326
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-326
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>            Reporter: eric baldeschwieler
> As discussed here:
> https://issues.apache.org/jira/browse/HADOOP-1986#action_12551237
> The templates, serializers and other complexities that allow map-reduce to use arbitrary
types complicate the design and lead to lots of object creates and other overhead that a byte
oriented design would not suffer.  I believe the lowest level implementation of hadoop map-reduce
should have byte string oriented APIs (for keys and values).  This API would be more performant,
simpler and more easily cross language.
> The existing API could be maintained as a thin layer on top of the leaner API.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message