hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Colin Patrick McCabe (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAPREDUCE-6538) Deprecate hadoop-pipes
Date Fri, 06 Nov 2015 22:43:11 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-6538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14994608#comment-14994608
] 

Colin Patrick McCabe commented on MAPREDUCE-6538:
-------------------------------------------------

bq. \[The Java client APIs provide significant advantages that neither streaming nor pipes
provide\]... is a false statement. Partitioning, for example, can't be done natively in streaming
code but can in pipes. In streaming, you can only provide a Java class.

I agree that supporting partitioning is an advantage of pipes that streaming doesn't have.
 There are still advantages that the Java API has over both, which is the point I was making.
 I also don't see a fundamental reason why streaming couldn't be extended to provide this,
which would be beneficial to languages like Python that can't use pipes.

bq. Correct. Because if the code is being written MR in C++, why would one use the less functional
streaming API? If one believes that MR jobs consist of nothing but reading and writing KVs
I could see that, but there's a lot more going on under the hood in more advanced jobs. That
functionality is just flat-out not available in streaming.

I would personally prefer to either use a JVM language or deal with the simple and clean stdout/stdin
paradigm of streaming, than deal with pipes.

There is a lot of technical debt in pipes.  It is hardcoded to output log messages to stderr
using {{fprintf}}.  Keys and values need to be serialized to C++ {{std::string}} objects.
 It doesn't follow the same coding style as the other C++ code in Hadoop.  It builds at {{\-O0}}
and doesn't generate a {{.so}}, just a {{.a}}.  There is no unit test suite, no concept of
what the API is or how it's allowed to change over time, and very little documentation.

[~aw], since you are committed to keeping pipes around, can you please file follow-on JIRAs
for fixing these issues and link them to this JIRA?  I will close this as WONTFIX.  We can
always revisit this later if things change.

> Deprecate hadoop-pipes
> ----------------------
>
>                 Key: MAPREDUCE-6538
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6538
>             Project: Hadoop Map/Reduce
>          Issue Type: Wish
>          Components: pipes
>            Reporter: Colin Patrick McCabe
>            Assignee: Colin Patrick McCabe
>            Priority: Minor
>
> Development appears to have stopped on hadoop-pipes upstream for the last few years,
aside from very basic maintenance.  Hadoop streaming seems to be a better alternative, since
it supports more programming languages and is better implemented.
> There were no responses to a message on the mailing list asking for users of Hadoop pipes...
and in my experience, I have never seen anyone use this.  We should remove it to reduce our
maintenance burden and build times.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message