Return-Path: X-Original-To: apmail-hadoop-mapreduce-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-mapreduce-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 63F221B8C for ; Wed, 20 Apr 2011 08:50:31 +0000 (UTC) Received: (qmail 53622 invoked by uid 500); 20 Apr 2011 08:50:22 -0000 Delivered-To: apmail-hadoop-mapreduce-user-archive@hadoop.apache.org Received: (qmail 51445 invoked by uid 500); 20 Apr 2011 08:48:10 -0000 Mailing-List: contact mapreduce-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: mapreduce-user@hadoop.apache.org Delivered-To: mailing list mapreduce-user@hadoop.apache.org Received: (qmail 50077 invoked by uid 99); 20 Apr 2011 08:47:15 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 20 Apr 2011 08:47:15 +0000 X-ASF-Spam-Status: No, hits=-2.3 required=5.0 tests=RCVD_IN_DNSWL_MED,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of Christoph.Schmitz@1und1.de designates 212.227.126.204 as permitted sender) Received: from [212.227.126.204] (HELO mxintern.schlund.de) (212.227.126.204) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 20 Apr 2011 08:47:09 +0000 Received: from [10.2.3.43] (helo=exnlb01.webde.local) by mxintern.schlund.de with esmtps (TLS1.0:RSA_AES_128_CBC_SHA1:16) (envelope-from ) id 1QCT3g-00073a-Bd for mapreduce-user@hadoop.apache.org; Wed, 20 Apr 2011 10:46:48 +0200 Received: from exchange03.webde.local ([169.254.1.234]) by exnlb01.webde.local ([10.2.3.43]) with mapi; Wed, 20 Apr 2011 10:46:47 +0200 From: Christoph Schmitz To: "'mapreduce-user@hadoop.apache.org'" Date: Wed, 20 Apr 2011 10:42:10 +0200 Subject: Out-of-band writing from mapper Thread-Topic: Out-of-band writing from mapper Thread-Index: Acv/N3sHF1ZWxtw2QiKHiDL+YKAKkw== Message-ID: <021F2BF78EE7544298904183FB24844A0F898D9401@EXCHANGE03.webde.local> Accept-Language: de-DE Content-Language: de-DE X-MS-Has-Attach: X-MS-TNEF-Correlator: acceptlanguage: de-DE Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-Virus-Scanned: Symantec AntiVirus Scan Engine X-UI-Msg-Verification: 1c2be9c7dd9ccbedfc4c25a363abb8c2 Hi, I need to process data in a Java MR job (using 0.20.1) in a way such that t= he largest part of the data is manipulated in the mapper only (i.e. some si= mple per-record transformation without the need for sort + shuffle), and so= me small pieces have to be passed on to the reducer. The mapper-only part o= f the data is so large (about six orders of magnitude larger than the rest)= that I want to spare the effort to sort and shuffle it just to pass it thr= ough an identity reducer. My question is: is there any mechanism to assist me in writing to some desi= gnated place in the HDFS from the mapper, in a way that is recognized by th= e framework (i.e. dealing with aborted tasks, speculative execution etc.)? I was thinking along the lines of what is described in the FAQ here: http://wiki.apache.org/hadoop/FAQ#Can_I_write_create.2BAC8-write-to_hdfs_fi= les_directly_from_map.2BAC8-reduce_tasks.3F The FAQ explains that for reducers, there is support for special per-task o= utput directories that are recognized by the framework, but it seems (I tri= ed it out) that this is not supported for mappers. Thanks and best regards, Christoph