hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Rosenstrauch <dar...@darose.net>
Subject Re: M/R file gather/scatter issue
Date Wed, 08 Dec 2010 22:21:44 GMT
Seems like CombineFileInputFormat.createPool() might help here.  But I'm 
a little unclear on usage.  That method is protected ... and so then I 
guess only accessible by subclasses?

Can anyone advise on usage here?



On 12/08/2010 11:25 AM, David Rosenstrauch wrote:
> Bit of a snag here:
> Since I'm thinking this app needs to use CombineFileInputFormat (since
> lots of small files) this throws a wrench into these plans a bit.
> CombineFileInputFormat creates CombineFileSplit's, not FileSplit's. And
> CombineFileSplit only contains a list of all the file paths whose data
> is included in the split, but no way to identify which file path a
> particular record came from.
> Any workaround here?
> Thanks,
> DR
> On 12/07/2010 11:08 PM, David Rosenstrauch wrote:
>> Thanks for the suggestion Shrijeet.
>> Same thought occurred to me on the way home from work after I sent this
>> mail. Not sure why, but my brain was kinda locked onto the concept of
>> the mapper being a no-op in this situation. Obviously doesn't have to be.
>> Let me try hacking this together and see how it goes. Thanks again much
>> for helping clarify my thinking.
>> DR
>> On 12/07/2010 07:02 PM, Shrijeet Paliwal wrote:
>>> ammm, how about modifying the key that you collect in the mapper to
>>> include some *additional* information (like filename) to hint reducer
>>> about records origin?
>>> -Shrijeet
>>> On Tue, Dec 7, 2010 at 3:43 PM, David Rosenstrauch<darose@darose.net>
>>> wrote:
>>>> Having an issue with some SequenceFiles that I generated, and I'm
>>>> trying to
>>>> write a M/R job to fix them.
>>>> Situation is roughly this:
>>>> I have a bunch of directories in HDFS, each of which contains a set
>>>> of 7
>>>> sequence files. Each sequence file is of a different "type", but the
>>>> key
>>>> type is the same across all of the sequence files. The value types -
>>>> which
>>>> are compressed - are also the same when in compressed form (i.e.,
>>>> BytesWritable), though the different record types are obviously
>>>> different
>>>> when uncompressed.
>>>> I want to write a job to fix some problems in the files. My thinking is
>>>> that I can feed all the data from all the files into a M/R job (i.e.,
>>>> gather), re-sort/partition the data properly, perform some additional
>>>> cleanup/fixup in the reducer, and then write the data back out to a
>>>> new set
>>>> of files (i.e., scatter).
>>>> Been digging through the API's, and it looks like
>>>> CombineFileInputFormat /
>>>> CombineFileRecordReader might be the way to go here. It'd let me
>>>> merge the
>>>> whole load of data from each of the (small) files into one M/R job
>>>> in an
>>>> efficient way.
>>>> Sorting would then occur by key, as would partitioning, so I'm still
>>>> good so
>>>> far.
>>>> Problem, however, is when I get to the reducer. The reducer needs to
>>>> know
>>>> which type of file data (i.e., which type of source file) a record
>>>> came from
>>>> so that it can a) uncompress/deserialize the data correctly, and b)
>>>> scatter
>>>> it out to the correct type of output file.
>>>> I'm not entirely clear how to make this happen. It seems like the
>>>> source
>>>> file information (which looks like it might exist on the
>>>> CombineFileSplit)
>>>> is no longer available by the time it gets to the reducer. And if the
>>>> reducer doesn't know which file a given record came from, it won't
>>>> know how
>>>> to process it properly.
>>>> Can anyone lend some suggestions on how to code this solution? Am I
>>>> on the
>>>> right track with the CombineFileInputFormat / CombineFileRecordReader
>>>> approach? If so, then how might I make the reducer code aware of the
>>>> source
>>>> of the record(s) it's currently processing?
>>>> TIA!
>>>> DR

View raw message