hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Doug Cutting (JIRA)" <j...@apache.org>
Subject [jira] Commented: (MAPREDUCE-326) The lowest level map-reduce APIs should be byte oriented
Date Mon, 15 Feb 2010 18:17:28 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12833907#action_12833907

Doug Cutting commented on MAPREDUCE-326:

Chris> If the goal is a faster binary API, then this should use NIO primitives [ ... ]

Perhaps this might instead look something like:

interface RawMapper {
  void map(Split, RawMapOutput);
interface RawMapOutput {
  // records are written as contiguous byte ranges here
  WritableByteChannel() getChannel();
  // call with the position of each record in the data written
  void addRecord(long start, long length);
  // utility to help keep track of bytes written
  long getBytesWritten();

The goal is to permit the kernel to identify record boundaries (so that it can compare, sort
and transmit records) while at the same time minimize per-record data copying.  Getting this
API right without benchmarking might prove difficult.  We should benchmark this under various
scenarios: A key/value pair of Writable instances, line-based data from a text file, and length-delimited,
raw binary data.

Chris> Better pipes/streaming workflows are explicitly considered in MAPREDUCE-1183; one
can imagine an implementation of the MapTask or ReduceTask loading its user code in an implementation
written in the native language.

Can you please elaborate?  I don't see the words "pipes" or "streaming" mentioned in that
issue.  How does one load Python, Ruby, C++, etc. into Java?  MAPREDUCE-1183 seems to me just
to be a different way to encapsulate configuration data, grouping it per extension point rather
than centralizing it in the job config.

> The lowest level map-reduce APIs should be byte oriented
> --------------------------------------------------------
>                 Key: MAPREDUCE-326
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-326
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>            Reporter: eric baldeschwieler
>         Attachments: MAPREDUCE-326-api.patch, MAPREDUCE-326.pdf
> As discussed here:
> https://issues.apache.org/jira/browse/HADOOP-1986#action_12551237
> The templates, serializers and other complexities that allow map-reduce to use arbitrary
types complicate the design and lead to lots of object creates and other overhead that a byte
oriented design would not suffer.  I believe the lowest level implementation of hadoop map-reduce
should have byte string oriented APIs (for keys and values).  This API would be more performant,
simpler and more easily cross language.
> The existing API could be maintained as a thin layer on top of the leaner API.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message