hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From java8964 <java8...@hotmail.com>
Subject RE: What if file format is dependent upon first few lines?
Date Thu, 27 Feb 2014 14:17:08 GMT
If the file is big enough and you want to split them for parallel processing, then maybe one
option could be that in your mapper, you can always get the full file path from the InputSplit,
then open it (The file path, which means you  can read from the the beginning), read the first
4 lines, and based on the content, processing the current split.
I believe the file in the HDFS can support concurrent read without any problem.
Yong

Date: Thu, 27 Feb 2014 17:59:38 +0800
Subject: What if file format is dependent upon first few lines?
From: raofengyun@gmail.com
To: user@hadoop.apache.org

Below is a fake sample of Microsoft IIS log:#Software: Microsoft Internet Information Services
7.5#Version: 1.0#Date: 2013-07-04 20:00:00#Fields: date time s-ip cs-method cs-uri-stem cs-uri-query
s-port cs-username c-ip cs(User-Agent) sc-status sc-substatus sc-win32-status time-taken
2013-07-04 20:00:00 1.1.1.1 GET /test.gif xxx 80 - 2.2.2.2 someuserAgent 200 0 0 3902013-07-04
20:00:00 1.1.1.1 GET /test.gif xxx 80 - 3.3.3.3 someuserAgent 200 0 0 3902013-07-04 20:00:00
1.1.1.1 GET /test.gif xxx 80 - 4.4.4.4 someuserAgent 200 0 0 390
...
The first four lines describe the file format, which is a must to parse each log line. It
means log file could NOT be simply splitted, otherwise the second split would lost the "file
format" information.

How could each mapper get the first few lines in the file? 		 	   		  
Mime
View raw message