hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Christoph Schmitz <Christoph.Schm...@1und1.de>
Subject Out-of-band writing from mapper
Date Wed, 20 Apr 2011 08:42:10 GMT
Hi,

I need to process data in a Java MR job (using 0.20.1) in a way such that the largest part
of the data is manipulated in the mapper only (i.e. some simple per-record transformation
without the need for sort + shuffle), and some small pieces have to be passed on to the reducer.
The mapper-only part of the data is so large (about six orders of magnitude larger than the
rest) that I want to spare the effort to sort and shuffle it just to pass it through an identity
reducer.

My question is: is there any mechanism to assist me in writing to some designated place in
the HDFS from the mapper, in a way that is recognized by the framework (i.e. dealing with
aborted tasks, speculative execution etc.)?

I was thinking along the lines of what is described in the FAQ here:

http://wiki.apache.org/hadoop/FAQ#Can_I_write_create.2BAC8-write-to_hdfs_files_directly_from_map.2BAC8-reduce_tasks.3F

The FAQ explains that for reducers, there is support for special per-task output directories
that are recognized by the framework, but it seems (I tried it out) that this is not supported
for mappers.

Thanks and best regards,

Christoph


Mime
View raw message