Mailing-List: contact mapreduce-issues-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Date: Fri, 10 Jun 2016 01:16:20 +0000 (UTC)
From: "Daniel Templeton (JIRA)" <jira@apache.org>
To: mapreduce-issues@hadoop.apache.org
Message-ID: <JIRA.12976720.1465366356000.47748.1465521380939@Atlassian.JIRA>
In-Reply-To: <JIRA.12976720.1465366356000@Atlassian.JIRA>
References: <JIRA.12976720.1465366356000@Atlassian.JIRA> <JIRA.12976720.1465366356209@arcas>
Subject: [jira] [Commented] (MAPREDUCE-6712) Support grouping values for
 reducer on java-side
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit
archived-at: Fri, 10 Jun 2016 01:16:22 -0000


    [ https://issues.apache.org/jira/browse/MAPREDUCE-6712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15323690#comment-15323690 ] 

Daniel Templeton commented on MAPREDUCE-6712:
---------------------------------------------

For C++ apps, there's Hadoop Pipes, which more closely models Java MapReduce.  For python, I strongly recommend taking a look at pyspark.  Hadoop Streaming is not intended to be high performance.  The general argument for the use of Streaming is that the time spent writing a Java MapReduce job would be more than the time lost by using Streaming.

I don't see a way to resolve this issue in any reasonable way.  If you include all values for a key in a single line, you have a strong chance of running the reducer out of memory trying to read it.  The only way I can see it working is in the case of typedbytes or with regular strings using some unambiguous value separator. You'd have to require that the reducer read the list of values one at a time rather than reading the entire line.  That seems like a pretty strict requirement and not something we'd want to enable in the general platform, especially when there is a clear and well tested workaround: Java MapReduce.

> Support grouping values for reducer on java-side
> ------------------------------------------------
>
>                 Key: MAPREDUCE-6712
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6712
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: contrib/streaming
>            Reporter: He Tianyi
>            Priority: Minor
>
> In hadoop streaming, with TextInputWriter, reducer program will receive each line representing a (k, v) tuple from {{stdin}}, in which values with identical key is not grouped.
> This brings some inefficiency, especially for runtimes based on interpreter (e.g. cpython), coming from:
> A. user program has to compare key with previous one (but on java side, records already come to reducer in groups),
> B. user program has to perform {{read}}, then {{find}} or {{split}} on each record. even if there are multiple values with identical key,
> C. if length of key is large, apparently this introduces inefficiency for caching,
> Suppose we need another InputWriter. But this is not enough, since the interface of {{InputWriter}} defined {{writeKey}} and {{writeValue}}, not {{writeValues}}. Though we can compare key in custom InputWriter and group them, but this is also inefficient. Some other changes are also needed.


--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: mapreduce-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-help@hadoop.apache.org