hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jay Booth (JIRA)" <j...@apache.org>
Subject [jira] Commented: (MAPREDUCE-326) The lowest level map-reduce APIs should be byte oriented
Date Tue, 16 Feb 2010 19:21:28 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12834395#action_12834395
] 

Jay Booth commented on MAPREDUCE-326:
-------------------------------------

Ok, maybe I got ahead of myself :)

Basically, I see this:
{quote}
public abstract void map(TaskSplitIndex splitIndex,
    RawMapOutputCollector collector, RawMapContext context)
    throws IOException, InterruptedException;
{quote}

as meaning "Your tasks just have to worry about an input split and then blasting their output
to the framework as bytes" -- it wouldn't be too far a leap from there to write a runtime
for mapred in other languages -- anyways, most salient to me personally was the fact that
there'd now be an API level where fetching/gathering your InputSplit could be handled at the
job/framework/mapper level -- if I want to do that now, I have to write a set of control files
and throw data locality out the window.  The fact that this would decouple the lower-level
APIs from a specific serialization framework would seem to be a win as well, AvroMapReduce
or whatever it's called could be built right alongside existing WritableMapReduce, which would
seem to make more sense than building one on top of the other.

If I understand the current proposal correctly, we could have a join where one mapper class
is pulling a big select statement from a DB, another is crunching some big compressed sequence
files, and another is pulling in a bunch of tiny Hive partitions using CombineFileInput, without
them stepping all over each other and creating "last one wins" configuration conditions. 
 This is theoretically doable under the current framework but it involves a lot of shoehorning,
so that's the itch it would be scratching for me.  Making the framework serialization agnostic
would I think be an even bigger win, Writables are clean and light but they're not the be-all
and end-all of serialization.

I guess I just see the proposed binary-level framework as a "gateway condition" to a whole
bunch of wins.  As long as everything is directly tied to Hadoop Writables in Java, there's
only going to be so far we can go beyond the basic wordcount program.  If we have a common,
robust, low-level binary API that's exposed for all to use, we could rapidly see framework
implementations in a few langauges, more flexible input methods, different serialization formats,
non-mapreduce distributed computing ("just distribute these runnables across the cluster and
tell me when they're done"), etc.  The immediate goal of having Avro talk to bytes instead
of Avro talks to Writables talk to bytes seems to be a decent enough short-term win to justify
the work, IMO, especially when you consider the long-term flexibility.

> The lowest level map-reduce APIs should be byte oriented
> --------------------------------------------------------
>
>                 Key: MAPREDUCE-326
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-326
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>            Reporter: eric baldeschwieler
>         Attachments: MAPREDUCE-326-api.patch, MAPREDUCE-326.pdf
>
>
> As discussed here:
> https://issues.apache.org/jira/browse/HADOOP-1986#action_12551237
> The templates, serializers and other complexities that allow map-reduce to use arbitrary
types complicate the design and lead to lots of object creates and other overhead that a byte
oriented design would not suffer.  I believe the lowest level implementation of hadoop map-reduce
should have byte string oriented APIs (for keys and values).  This API would be more performant,
simpler and more easily cross language.
> The existing API could be maintained as a thin layer on top of the leaner API.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message