hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jay Vyas <jayunit...@gmail.com>
Subject Re: What if file format is dependent upon first few lines?
Date Fri, 28 Feb 2014 02:32:58 GMT
-- method 1 --

You could, i think, just extend fileinputformat, with isSplittable =
false.  Then each file wont be brokeen up into separate blocks, and
processed as a whole per mapper.  This is probably the easiest thing to do
but if you have huge files, it wont perform very well.

-- method 2 --

You can use Harsh's suggestion (thanks for that idea, i didnt know it).

1) In the setup method of a mapper, you can get the file path : using

((FileSplit) context.getInputSplit()).getPath();


2) Then , in the mappers "setup" method, you should be able open a file
input stream and call "seek(0)" to read the file header, as Harsh sais.

3) When you process the header, you can store the results in the Setup
method as a local variable, and the mapper can read from that variable and
proceed.




On Thu, Feb 27, 2014 at 9:09 PM, Fengyun RAO <raofengyun@gmail.com> wrote:

> thanks, Harsh.
>
> could you specify more detail, or give some links or an example where I
> can start?
>
>
>
> 2014-02-27 22:17 GMT+08:00 Harsh J <harsh@cloudera.com>:
>
> A mapper's record reader implementation need not be restricted to
>> strictly only the input split boundary. It is a loose relationship -
>> you can always seek(0), read the lines you need to prepare, then
>> seek(offset) and continue reading.
>>
>> Apache Avro (http://avro.apache.org) has a similar format - header
>> contains the schema a reader needs to work.
>>
>> On Thu, Feb 27, 2014 at 1:59 AM, Fengyun RAO <raofengyun@gmail.com>
>> wrote:
>> > Below is a fake sample of Microsoft IIS log:
>> > #Software: Microsoft Internet Information Services 7.5
>> > #Version: 1.0
>> > #Date: 2013-07-04 20:00:00
>> > #Fields: date time s-ip cs-method cs-uri-stem cs-uri-query s-port
>> > cs-username c-ip cs(User-Agent) sc-status sc-substatus sc-win32-status
>> > time-taken
>> > 2013-07-04 20:00:00 1.1.1.1 GET /test.gif xxx 80 - 2.2.2.2
>> someuserAgent 200
>> > 0 0 390
>> > 2013-07-04 20:00:00 1.1.1.1 GET /test.gif xxx 80 - 3.3.3.3
>> someuserAgent 200
>> > 0 0 390
>> > 2013-07-04 20:00:00 1.1.1.1 GET /test.gif xxx 80 - 4.4.4.4
>> someuserAgent 200
>> > 0 0 390
>> > ...
>> >
>> > The first four lines describe the file format, which is a must to parse
>> each
>> > log line. It means log file could NOT be simply splitted, otherwise the
>> > second split would lost the "file format" information.
>> >
>> > How could each mapper get the first few lines in the file?
>>
>>
>>
>> --
>> Harsh J
>>
>
>


-- 
Jay Vyas
http://jayunit100.blogspot.com

Mime
View raw message