hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ali Safdar Kureishy <safdar.kurei...@gmail.com>
Subject Re: Reading data output by MapFileOutputFormat
Date Mon, 23 Apr 2012 13:11:03 GMT
Thanks Harsh! This is very helpful.


On Mon, Apr 23, 2012 at 2:08 PM, Harsh J <harsh@cloudera.com> wrote:
> Ali,
> MapFiles are explained at
> http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/MapFile.html
> - Please give it a read and it should solve half your questions. In
> short, MapFile is two files - one raw SequenceFile and another an
> index file built on top of it.
> The reason MR does not provide a MapFileInputFormat is that you don't
> need to use the index file in MR jobs (no lookups for input-driven
> jobs). Hence the SequenceFileInputFormat suffices to read the data (it
> ignores the index file, and only reads the sequence ones that carries
> the data).
> If you wish to make use of MapFile's index abilities for lookups/etc.,
> use the MapFile.Reader class directly in your implementation.
> On Mon, Apr 23, 2012 at 4:23 PM, Ali Safdar Kureishy
> <safdar.kureishy@gmail.com> wrote:
>> Hi,
>> If I use a *MapFileOutputFormat* to output some data, I see that each
>> reducer's output is a folder ("part-00000", for example), and inside that
>> folder are two files: "data" and "index".
>> However, there is no corresponding MapFileInputFormat, to read back this
>> folder ("part-00000"). Instead, *SequenceFileInputFormat* seems to read the
>> data. So, I have some questions:
>> - does SequenceFileInputFormat actually read *all* the data that was output
>> by MapFileOutputFormat? Or is some relationship data between the data and
>> index files lost in this process that would have been better handled by
>> another InputFormat class? In other words, is SequenceFileInputFormat the
>> right InputFormat to read data written by MapFileOutputFormat?
>> - how is it that SequenceFileInputFormat works to read outputs from
>> *both*MapFileOutputFormat and SequenceFileOutputFormat? That would
>> imply that
>> MapFileOutputFormat and SequenceFileOutputFormat output the same data, OR
>> that SequenceFileInputFormat internally handles both differently. What is
>> the reality?
>> Thanks,
>> Safdar
> --
> Harsh J

View raw message