hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Fengyun RAO <raofeng...@gmail.com>
Subject Re: What if file format is dependent upon first few lines?
Date Fri, 28 Feb 2014 02:09:02 GMT
thanks, Harsh.

could you specify more detail, or give some links or an example where I can
start?



2014-02-27 22:17 GMT+08:00 Harsh J <harsh@cloudera.com>:

> A mapper's record reader implementation need not be restricted to
> strictly only the input split boundary. It is a loose relationship -
> you can always seek(0), read the lines you need to prepare, then
> seek(offset) and continue reading.
>
> Apache Avro (http://avro.apache.org) has a similar format - header
> contains the schema a reader needs to work.
>
> On Thu, Feb 27, 2014 at 1:59 AM, Fengyun RAO <raofengyun@gmail.com> wrote:
> > Below is a fake sample of Microsoft IIS log:
> > #Software: Microsoft Internet Information Services 7.5
> > #Version: 1.0
> > #Date: 2013-07-04 20:00:00
> > #Fields: date time s-ip cs-method cs-uri-stem cs-uri-query s-port
> > cs-username c-ip cs(User-Agent) sc-status sc-substatus sc-win32-status
> > time-taken
> > 2013-07-04 20:00:00 1.1.1.1 GET /test.gif xxx 80 - 2.2.2.2 someuserAgent
> 200
> > 0 0 390
> > 2013-07-04 20:00:00 1.1.1.1 GET /test.gif xxx 80 - 3.3.3.3 someuserAgent
> 200
> > 0 0 390
> > 2013-07-04 20:00:00 1.1.1.1 GET /test.gif xxx 80 - 4.4.4.4 someuserAgent
> 200
> > 0 0 390
> > ...
> >
> > The first four lines describe the file format, which is a must to parse
> each
> > log line. It means log file could NOT be simply splitted, otherwise the
> > second split would lost the "file format" information.
> >
> > How could each mapper get the first few lines in the file?
>
>
>
> --
> Harsh J
>

Mime
View raw message