Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id A9A7B200AE4 for ; Fri, 10 Jun 2016 03:16:22 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id A811E160A59; Fri, 10 Jun 2016 01:16:22 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id F0BAD160A58 for ; Fri, 10 Jun 2016 03:16:21 +0200 (CEST) Received: (qmail 1863 invoked by uid 500); 10 Jun 2016 01:16:21 -0000 Mailing-List: contact mapreduce-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list mapreduce-issues@hadoop.apache.org Received: (qmail 1851 invoked by uid 99); 10 Jun 2016 01:16:21 -0000 Received: from arcas.apache.org (HELO arcas) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 10 Jun 2016 01:16:21 +0000 Received: from arcas.apache.org (localhost [127.0.0.1]) by arcas (Postfix) with ESMTP id E64762C1F54 for ; Fri, 10 Jun 2016 01:16:20 +0000 (UTC) Date: Fri, 10 Jun 2016 01:16:20 +0000 (UTC) From: "Daniel Templeton (JIRA)" To: mapreduce-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (MAPREDUCE-6712) Support grouping values for reducer on java-side MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Fri, 10 Jun 2016 01:16:22 -0000 [ https://issues.apache.org/jira/browse/MAPREDUCE-6712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15323690#comment-15323690 ] Daniel Templeton commented on MAPREDUCE-6712: --------------------------------------------- For C++ apps, there's Hadoop Pipes, which more closely models Java MapReduce. For python, I strongly recommend taking a look at pyspark. Hadoop Streaming is not intended to be high performance. The general argument for the use of Streaming is that the time spent writing a Java MapReduce job would be more than the time lost by using Streaming. I don't see a way to resolve this issue in any reasonable way. If you include all values for a key in a single line, you have a strong chance of running the reducer out of memory trying to read it. The only way I can see it working is in the case of typedbytes or with regular strings using some unambiguous value separator. You'd have to require that the reducer read the list of values one at a time rather than reading the entire line. That seems like a pretty strict requirement and not something we'd want to enable in the general platform, especially when there is a clear and well tested workaround: Java MapReduce. > Support grouping values for reducer on java-side > ------------------------------------------------ > > Key: MAPREDUCE-6712 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6712 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: contrib/streaming > Reporter: He Tianyi > Priority: Minor > > In hadoop streaming, with TextInputWriter, reducer program will receive each line representing a (k, v) tuple from {{stdin}}, in which values with identical key is not grouped. > This brings some inefficiency, especially for runtimes based on interpreter (e.g. cpython), coming from: > A. user program has to compare key with previous one (but on java side, records already come to reducer in groups), > B. user program has to perform {{read}}, then {{find}} or {{split}} on each record. even if there are multiple values with identical key, > C. if length of key is large, apparently this introduces inefficiency for caching, > Suppose we need another InputWriter. But this is not enough, since the interface of {{InputWriter}} defined {{writeKey}} and {{writeValue}}, not {{writeValues}}. Though we can compare key in custom InputWriter and group them, but this is also inefficient. Some other changes are also needed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: mapreduce-issues-unsubscribe@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-help@hadoop.apache.org