hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Christoph Schmitz <Christoph.Schm...@1und1.de>
Subject Out-of-band writing from mapper
Date Wed, 20 Apr 2011 08:42:10 GMT

I need to process data in a Java MR job (using 0.20.1) in a way such that the largest part
of the data is manipulated in the mapper only (i.e. some simple per-record transformation
without the need for sort + shuffle), and some small pieces have to be passed on to the reducer.
The mapper-only part of the data is so large (about six orders of magnitude larger than the
rest) that I want to spare the effort to sort and shuffle it just to pass it through an identity

My question is: is there any mechanism to assist me in writing to some designated place in
the HDFS from the mapper, in a way that is recognized by the framework (i.e. dealing with
aborted tasks, speculative execution etc.)?

I was thinking along the lines of what is described in the FAQ here:


The FAQ explains that for reducers, there is support for special per-task output directories
that are recognized by the framework, but it seems (I tried it out) that this is not supported
for mappers.

Thanks and best regards,


View raw message