hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Berry, Matt" <mwbe...@amazon.com>
Subject OutputFormat Theory Question
Date Thu, 19 Jul 2012 16:22:12 GMT
>From what I gather about how Map Reduce operates, there isn't really any functional difference
between whether a single OutputFormat object is initialized on a central node or if each reducer
task initializes its own OutputFormat object. What I would like to  know however, is the relationship
between the records that are passed to the OutputFormat from the reducers. Take the case of
a sorting MapReduce job, where the mapper and reducer are both identity functions. In this
setup, I would expect that the records being passed to the OutputFormat from the reducer are
sorted and are arriving in-order.

A simplified version of my use-case is to sort a large number of records, and then write all
the ones that start with A to a file named A, B to B, etc. Due to the fact that each file
can only be opened for writing once, it is very important in this use case to know if the
records arrive at the OutputFormat in-order so I know it is safe to close file A when I encounter
a record that belongs in B.

Matthew Berry

View raw message