Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hadoop.apache.org
Received-SPF: pass (athena.apache.org: domain of java8964@hotmail.com
 designates 65.54.61.101 as permitted sender)
Message-ID: <SNT149-W60D1606D31178C9CDD6D04D0830@phx.gbl>
Content-Type: multipart/alternative;
	boundary="_5a94bbf3-5095-411f-babc-3daf59e1c406_"
From: java8964 <java8964@hotmail.com>
To: "user@hadoop.apache.org" <user@hadoop.apache.org>
Subject: RE: What if file format is dependent upon first few lines?
Date: Thu, 27 Feb 2014 09:17:08 -0500
Importance: Normal
In-Reply-To: 
 <CAGSyEuC+=Dc5jfP=7ZdL2b=yp+s5vpkEw9X8-bCZDCNSuFKwyw@mail.gmail.com>
References: 
 <CAGSyEuC+=Dc5jfP=7ZdL2b=yp+s5vpkEw9X8-bCZDCNSuFKwyw@mail.gmail.com>
MIME-Version: 1.0

--_5a94bbf3-5095-411f-babc-3daf59e1c406_
Content-Type: text/plain; charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

If the file is big enough and you want to split them for parallel processin=
g=2C then maybe one option could be that in your mapper=2C you can always g=
et the full file path from the InputSplit=2C then open it (The file path=2C=
 which means you  can read from the the beginning)=2C read the first 4 line=
s=2C and based on the content=2C processing the current split.
I believe the file in the HDFS can support concurrent read without any prob=
lem.
Yong

Date: Thu=2C 27 Feb 2014 17:59:38 +0800
Subject: What if file format is dependent upon first few lines?
From: raofengyun@gmail.com
To: user@hadoop.apache.org

Below is a fake sample of Microsoft IIS log:#Software: Microsoft Internet I=
nformation Services 7.5#Version: 1.0#Date: 2013-07-04 20:00:00#Fields: date=
 time s-ip cs-method cs-uri-stem cs-uri-query s-port cs-username c-ip cs(Us=
er-Agent) sc-status sc-substatus sc-win32-status time-taken=0A=
2013-07-04 20:00:00 1.1.1.1 GET /test.gif xxx 80 - 2.2.2.2 someuserAgent 20=
0 0 0 3902013-07-04 20:00:00 1.1.1.1 GET /test.gif xxx 80 - 3.3.3.3 someuse=
rAgent 200 0 0 3902013-07-04 20:00:00 1.1.1.1 GET /test.gif xxx 80 - 4.4.4.=
4 someuserAgent 200 0 0 390=0A=
...
The first four lines describe the file format=2C which is a must to parse e=
ach log line. It means log file could NOT be simply splitted=2C otherwise t=
he second split would lost the "file format" information.=0A=

How could each mapper get the first few lines in the file? 		 	   		  =

--_5a94bbf3-5095-411f-babc-3daf59e1c406_
Content-Type: text/html; charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

<html>
<head>
<style><!--
.hmmessage P
{
margin:0px=3B
padding:0px
}
body.hmmessage
{
font-size: 12pt=3B
font-family:Calibri
}
--></style></head>
<body class=3D'hmmessage'><div dir=3D'ltr'>If the file is big enough and yo=
u want to split them for parallel processing=2C then maybe one option could=
 be that in your mapper=2C you can always get the full file path from the I=
nputSplit=2C then open it (The file path=2C which means you &nbsp=3Bcan rea=
d from the the beginning)=2C read the first 4 lines=2C and based on the con=
tent=2C processing the current split.<div><br></div><div>I believe the file=
 in the HDFS can support concurrent read without any problem.</div><div><br=
></div><div>Yong<br><br><div><hr id=3D"stopSpelling">Date: Thu=2C 27 Feb 20=
14 17:59:38 +0800<br>Subject: What if file format is dependent upon first f=
ew lines?<br>From: raofengyun@gmail.com<br>To: user@hadoop.apache.org<br><b=
r><div dir=3D"ltr">Below is a fake sample of Microsoft IIS log:<div><div>#S=
oftware: Microsoft Internet Information Services 7.5</div><div>#Version: 1.=
0</div><div>#Date: 2013-07-04 20:00:00</div><div>#Fields: date time s-ip cs=
-method cs-uri-stem cs-uri-query s-port cs-username c-ip cs(User-Agent) sc-=
status sc-substatus sc-win32-status time-taken</div>=0A=
<div>2013-07-04 20:00:00 1.1.1.1 GET /test.gif xxx 80 - 2.2.2.2 someuserAge=
nt 200 0 0 390</div></div><div><div>2013-07-04 20:00:00 1.1.1.1 GET /test.g=
if xxx 80 - 3.3.3.3 someuserAgent 200 0 0 390</div></div><div><div>2013-07-=
04 20:00:00 1.1.1.1 GET /test.gif xxx 80 - 4.4.4.4 someuserAgent 200 0 0 39=
0</div>=0A=
</div><div>...</div><div><br></div><div>The first four lines describe the f=
ile format=2C which is a must to parse each log line. It means log file cou=
ld NOT be simply splitted=2C otherwise the second split would lost the "fil=
e format" information.</div>=0A=
<div><br></div><div>How could each mapper get the first few lines in the fi=
le?</div></div></div></div> 		 	   		  </div></body>
</html>=

--_5a94bbf3-5095-411f-babc-3daf59e1c406_--