hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shrijeet Paliwal <shrij...@rocketfuel.com>
Subject Re: M/R file gather/scatter issue
Date Wed, 08 Dec 2010 00:02:01 GMT
ammm, how about modifying the key that you collect in the mapper to
include some *additional* information (like filename) to hint reducer
about records origin?


On Tue, Dec 7, 2010 at 3:43 PM, David Rosenstrauch <darose@darose.net> wrote:
> Having an issue with some SequenceFiles that I generated, and I'm trying to
> write a M/R job to fix them.
> Situation is roughly this:
> I have a bunch of directories in HDFS, each of which contains a set of 7
> sequence files.  Each sequence file is of a different "type", but the key
> type is the same across all of the sequence files.  The value types - which
> are compressed - are also the same when in compressed form (i.e.,
> BytesWritable), though the different record types are obviously different
> when uncompressed.
> I want to write a job to fix some problems in the files.  My thinking is
> that I can feed all the data from all the files into a M/R job (i.e.,
> gather), re-sort/partition the data properly, perform some additional
> cleanup/fixup in the reducer, and then write the data back out to a new set
> of files (i.e., scatter).
> Been digging through the API's, and it looks like CombineFileInputFormat /
> CombineFileRecordReader might be the way to go here.  It'd let me merge the
> whole load of data from each of the (small) files into one M/R job in an
> efficient way.
> Sorting would then occur by key, as would partitioning, so I'm still good so
> far.
> Problem, however, is when I get to the reducer.  The reducer needs to know
> which type of file data (i.e., which type of source file) a record came from
> so that it can a) uncompress/deserialize the data correctly, and b) scatter
> it out to the correct type of output file.
> I'm not entirely clear how to make this happen.  It seems like the source
> file information (which looks like it might exist on the CombineFileSplit)
> is no longer available by the time it gets to the reducer.  And if the
> reducer doesn't know which file a given record came from, it won't know how
> to process it properly.
> Can anyone lend some suggestions on how to code this solution?  Am I on the
> right track with the CombineFileInputFormat / CombineFileRecordReader
> approach?  If so, then how might I make the reducer code aware of the source
> of the record(s) it's currently processing?
> TIA!
> DR

View raw message