hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Rosenstrauch <dar...@darose.net>
Subject Re: M/R file gather/scatter issue
Date Wed, 08 Dec 2010 16:25:24 GMT
Bit of a snag here:

Since I'm thinking this app needs to use CombineFileInputFormat (since 
lots of small files) this throws a wrench into these plans a bit. 
CombineFileInputFormat creates CombineFileSplit's, not FileSplit's.  And 
CombineFileSplit only contains a list of all the file paths whose data 
is included in the split, but no way to identify which file path a 
particular record came from.

Any workaround here?

Thanks,

DR

On 12/07/2010 11:08 PM, David Rosenstrauch wrote:
> Thanks for the suggestion Shrijeet.
>
> Same thought occurred to me on the way home from work after I sent this
> mail. Not sure why, but my brain was kinda locked onto the concept of
> the mapper being a no-op in this situation. Obviously doesn't have to be.
>
> Let me try hacking this together and see how it goes. Thanks again much
> for helping clarify my thinking.
>
> DR
>
> On 12/07/2010 07:02 PM, Shrijeet Paliwal wrote:
>> ammm, how about modifying the key that you collect in the mapper to
>> include some *additional* information (like filename) to hint reducer
>> about records origin?
>>
>> -Shrijeet
>>
>> On Tue, Dec 7, 2010 at 3:43 PM, David Rosenstrauch<darose@darose.net>
>> wrote:
>>> Having an issue with some SequenceFiles that I generated, and I'm
>>> trying to
>>> write a M/R job to fix them.
>>>
>>> Situation is roughly this:
>>>
>>> I have a bunch of directories in HDFS, each of which contains a set of 7
>>> sequence files. Each sequence file is of a different "type", but the key
>>> type is the same across all of the sequence files. The value types -
>>> which
>>> are compressed - are also the same when in compressed form (i.e.,
>>> BytesWritable), though the different record types are obviously
>>> different
>>> when uncompressed.
>>>
>>> I want to write a job to fix some problems in the files. My thinking is
>>> that I can feed all the data from all the files into a M/R job (i.e.,
>>> gather), re-sort/partition the data properly, perform some additional
>>> cleanup/fixup in the reducer, and then write the data back out to a
>>> new set
>>> of files (i.e., scatter).
>>>
>>>
>>> Been digging through the API's, and it looks like
>>> CombineFileInputFormat /
>>> CombineFileRecordReader might be the way to go here. It'd let me
>>> merge the
>>> whole load of data from each of the (small) files into one M/R job in an
>>> efficient way.
>>>
>>> Sorting would then occur by key, as would partitioning, so I'm still
>>> good so
>>> far.
>>>
>>> Problem, however, is when I get to the reducer. The reducer needs to
>>> know
>>> which type of file data (i.e., which type of source file) a record
>>> came from
>>> so that it can a) uncompress/deserialize the data correctly, and b)
>>> scatter
>>> it out to the correct type of output file.
>>>
>>> I'm not entirely clear how to make this happen. It seems like the source
>>> file information (which looks like it might exist on the
>>> CombineFileSplit)
>>> is no longer available by the time it gets to the reducer. And if the
>>> reducer doesn't know which file a given record came from, it won't
>>> know how
>>> to process it properly.
>>>
>>> Can anyone lend some suggestions on how to code this solution? Am I
>>> on the
>>> right track with the CombineFileInputFormat / CombineFileRecordReader
>>> approach? If so, then how might I make the reducer code aware of the
>>> source
>>> of the record(s) it's currently processing?
>>>
>>> TIA!
>>>
>>> DR
>>>
>


Mime
View raw message