hadoop-hdfs-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Fengyun RAO <raofeng...@gmail.com>
Subject What if file format is dependent upon first few lines?
Date Thu, 27 Feb 2014 09:59:38 GMT
Below is a fake sample of Microsoft IIS log:
#Software: Microsoft Internet Information Services 7.5
#Version: 1.0
#Date: 2013-07-04 20:00:00
#Fields: date time s-ip cs-method cs-uri-stem cs-uri-query s-port
cs-username c-ip cs(User-Agent) sc-status sc-substatus sc-win32-status
time-taken
2013-07-04 20:00:00 1.1.1.1 GET /test.gif xxx 80 - 2.2.2.2 someuserAgent
200 0 0 390
2013-07-04 20:00:00 1.1.1.1 GET /test.gif xxx 80 - 3.3.3.3 someuserAgent
200 0 0 390
2013-07-04 20:00:00 1.1.1.1 GET /test.gif xxx 80 - 4.4.4.4 someuserAgent
200 0 0 390
...

The first four lines describe the file format, which is a must to parse
each log line. It means log file could NOT be simply splitted, otherwise
the second split would lost the "file format" information.

How could each mapper get the first few lines in the file?

Mime
View raw message