hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stuart White <stuart.whi...@gmail.com>
Subject Release batched-up output records at end-of-job?
Date Tue, 17 Mar 2009 12:12:50 GMT
I have a mapred job that simply performs data transformations in its
Mapper.  I don't need sorting or reduction, so I don't use a Reducer.

Without getting too detailed, the nature of my processing is such that
it is much more efficient if I can process blocks of records
at-a-time.  So, what I'd like to do is, in my Mapper, in the map()
function, simply add the incoming record to a list, and once that list
reaches a certain size, process the batched-up records, and then call
output.collect() multiple times to release the output records, each
corresponding to one of the input records.

At the end of the job, my Mappers will have partially full blocks of
records.  I'd like to go ahead and process these blocks at end-of-job,
regardless of their sizes, and release the corresponding output

How can I accomplish this?  In my Mapper#map(), I have no way of
knowing whether a record is the final record.  The only end-of-job
hook that I'm aware of is for my Mapper to override
MapReduceBase#close(), but when in that method, there is no
OutputCollector available.

Is it possible to batch-up records, and at end-of-job, process and
release any final partial blocks?


View raw message