hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Fengyun RAO <raofeng...@gmail.com>
Subject Re: What if file format is dependent upon first few lines?
Date Fri, 28 Feb 2014 13:28:02 GMT
thanks, Jay, it really helps.

2014-02-28 10:32 GMT+08:00 Jay Vyas <jayunit100@gmail.com>:

> -- method 1 --
>
> You could, i think, just extend fileinputformat, with isSplittable =
> false.  Then each file wont be brokeen up into separate blocks, and
> processed as a whole per mapper.  This is probably the easiest thing to do
> but if you have huge files, it wont perform very well.
>
> -- method 2 --
>
> You can use Harsh's suggestion (thanks for that idea, i didnt know it).
>
> 1) In the setup method of a mapper, you can get the file path : using
>
> ((FileSplit) context.getInputSplit()).getPath();
>
>
> 2) Then , in the mappers "setup" method, you should be able open a file
> input stream and call "seek(0)" to read the file header, as Harsh sais.
>
> 3) When you process the header, you can store the results in the Setup
> method as a local variable, and the mapper can read from that variable and
> proceed.
>
>
>
>
> On Thu, Feb 27, 2014 at 9:09 PM, Fengyun RAO <raofengyun@gmail.com> wrote:
>
>> thanks, Harsh.
>>
>> could you specify more detail, or give some links or an example where I
>> can start?
>>
>>
>>
>> 2014-02-27 22:17 GMT+08:00 Harsh J <harsh@cloudera.com>:
>>
>> A mapper's record reader implementation need not be restricted to
>>> strictly only the input split boundary. It is a loose relationship -
>>> you can always seek(0), read the lines you need to prepare, then
>>> seek(offset) and continue reading.
>>>
>>> Apache Avro (http://avro.apache.org) has a similar format - header
>>> contains the schema a reader needs to work.
>>>
>>> On Thu, Feb 27, 2014 at 1:59 AM, Fengyun RAO <raofengyun@gmail.com>
>>> wrote:
>>> > Below is a fake sample of Microsoft IIS log:
>>> > #Software: Microsoft Internet Information Services 7.5
>>> > #Version: 1.0
>>> > #Date: 2013-07-04 20:00:00
>>> > #Fields: date time s-ip cs-method cs-uri-stem cs-uri-query s-port
>>> > cs-username c-ip cs(User-Agent) sc-status sc-substatus sc-win32-status
>>> > time-taken
>>> > 2013-07-04 20:00:00 1.1.1.1 GET /test.gif xxx 80 - 2.2.2.2
>>> someuserAgent 200
>>> > 0 0 390
>>> > 2013-07-04 20:00:00 1.1.1.1 GET /test.gif xxx 80 - 3.3.3.3
>>> someuserAgent 200
>>> > 0 0 390
>>> > 2013-07-04 20:00:00 1.1.1.1 GET /test.gif xxx 80 - 4.4.4.4
>>> someuserAgent 200
>>> > 0 0 390
>>> > ...
>>> >
>>> > The first four lines describe the file format, which is a must to
>>> parse each
>>> > log line. It means log file could NOT be simply splitted, otherwise the
>>> > second split would lost the "file format" information.
>>> >
>>> > How could each mapper get the first few lines in the file?
>>>
>>>
>>>
>>> --
>>> Harsh J
>>>
>>
>>
>
>
> --
> Jay Vyas
> http://jayunit100.blogspot.com
>

Mime
View raw message