Mailing-List: contact mapreduce-user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: mapreduce-user@hadoop.apache.org
Received-SPF: pass (athena.apache.org: domain of Christoph.Schmitz@1und1.de
 designates 212.227.126.204 as permitted sender)
From: Christoph Schmitz <Christoph.Schmitz@1und1.de>
To: "'mapreduce-user@hadoop.apache.org'" <mapreduce-user@hadoop.apache.org>
Date: Wed, 20 Apr 2011 10:42:10 +0200
Subject: Out-of-band writing from mapper
Thread-Topic: Out-of-band writing from mapper
Thread-Index: Acv/N3sHF1ZWxtw2QiKHiDL+YKAKkw==
Message-ID: 
 <021F2BF78EE7544298904183FB24844A0F898D9401@EXCHANGE03.webde.local>
Accept-Language: de-DE
Content-Language: de-DE
acceptlanguage: de-DE
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0

Hi,

I need to process data in a Java MR job (using 0.20.1) in a way such that t=
he largest part of the data is manipulated in the mapper only (i.e. some si=
mple per-record transformation without the need for sort + shuffle), and so=
me small pieces have to be passed on to the reducer. The mapper-only part o=
f the data is so large (about six orders of magnitude larger than the rest)=
 that I want to spare the effort to sort and shuffle it just to pass it thr=
ough an identity reducer.

My question is: is there any mechanism to assist me in writing to some desi=
gnated place in the HDFS from the mapper, in a way that is recognized by th=
e framework (i.e. dealing with aborted tasks, speculative execution etc.)?

I was thinking along the lines of what is described in the FAQ here:

http://wiki.apache.org/hadoop/FAQ#Can_I_write_create.2BAC8-write-to_hdfs_fi=
les_directly_from_map.2BAC8-reduce_tasks.3F

The FAQ explains that for reducers, there is support for special per-task o=
utput directories that are recognized by the framework, but it seems (I tri=
ed it out) that this is not supported for mappers.

Thanks and best regards,

Christoph