hadoop-hdfs-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Harsh J <ha...@cloudera.com>
Subject Re: What if file format is dependent upon first few lines?
Date Thu, 27 Feb 2014 14:17:20 GMT
A mapper's record reader implementation need not be restricted to
strictly only the input split boundary. It is a loose relationship -
you can always seek(0), read the lines you need to prepare, then
seek(offset) and continue reading.

Apache Avro (http://avro.apache.org) has a similar format - header
contains the schema a reader needs to work.

On Thu, Feb 27, 2014 at 1:59 AM, Fengyun RAO <raofengyun@gmail.com> wrote:
> Below is a fake sample of Microsoft IIS log:
> #Software: Microsoft Internet Information Services 7.5
> #Version: 1.0
> #Date: 2013-07-04 20:00:00
> #Fields: date time s-ip cs-method cs-uri-stem cs-uri-query s-port
> cs-username c-ip cs(User-Agent) sc-status sc-substatus sc-win32-status
> time-taken
> 2013-07-04 20:00:00 1.1.1.1 GET /test.gif xxx 80 - 2.2.2.2 someuserAgent 200
> 0 0 390
> 2013-07-04 20:00:00 1.1.1.1 GET /test.gif xxx 80 - 3.3.3.3 someuserAgent 200
> 0 0 390
> 2013-07-04 20:00:00 1.1.1.1 GET /test.gif xxx 80 - 4.4.4.4 someuserAgent 200
> 0 0 390
> ...
>
> The first four lines describe the file format, which is a must to parse each
> log line. It means log file could NOT be simply splitted, otherwise the
> second split would lost the "file format" information.
>
> How could each mapper get the first few lines in the file?



-- 
Harsh J

Mime
View raw message