hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Daniel Templeton (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAPREDUCE-6712) Support grouping values for reducer on java-side
Date Fri, 10 Jun 2016 01:16:20 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-6712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15323690#comment-15323690
] 

Daniel Templeton commented on MAPREDUCE-6712:
---------------------------------------------

For C++ apps, there's Hadoop Pipes, which more closely models Java MapReduce.  For python,
I strongly recommend taking a look at pyspark.  Hadoop Streaming is not intended to be high
performance.  The general argument for the use of Streaming is that the time spent writing
a Java MapReduce job would be more than the time lost by using Streaming.

I don't see a way to resolve this issue in any reasonable way.  If you include all values
for a key in a single line, you have a strong chance of running the reducer out of memory
trying to read it.  The only way I can see it working is in the case of typedbytes or with
regular strings using some unambiguous value separator. You'd have to require that the reducer
read the list of values one at a time rather than reading the entire line.  That seems like
a pretty strict requirement and not something we'd want to enable in the general platform,
especially when there is a clear and well tested workaround: Java MapReduce.

> Support grouping values for reducer on java-side
> ------------------------------------------------
>
>                 Key: MAPREDUCE-6712
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6712
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: contrib/streaming
>            Reporter: He Tianyi
>            Priority: Minor
>
> In hadoop streaming, with TextInputWriter, reducer program will receive each line representing
a (k, v) tuple from {{stdin}}, in which values with identical key is not grouped.
> This brings some inefficiency, especially for runtimes based on interpreter (e.g. cpython),
coming from:
> A. user program has to compare key with previous one (but on java side, records already
come to reducer in groups),
> B. user program has to perform {{read}}, then {{find}} or {{split}} on each record. even
if there are multiple values with identical key,
> C. if length of key is large, apparently this introduces inefficiency for caching,
> Suppose we need another InputWriter. But this is not enough, since the interface of {{InputWriter}}
defined {{writeKey}} and {{writeValue}}, not {{writeValues}}. Though we can compare key in
custom InputWriter and group them, but this is also inefficient. Some other changes are also
needed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: mapreduce-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-help@hadoop.apache.org


Mime
View raw message