Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hadoop.apache.org
Received-SPF: pass (nike.apache.org: domain of matthias.scherer@1und1.de
 designates 212.227.126.204 as permitted sender)
From: Matthias Scherer <matthias.scherer@1und1.de>
To: "user@hadoop.apache.org" <user@hadoop.apache.org>
Date: Thu, 18 Apr 2013 21:34:54 +0200
Subject: How to process only input files containing 100% valid rows
Thread-Topic: How to process only input files containing 100% valid rows
Thread-Index: Ac48a80eVcra85XuSEautt1oLVi3eg==
Message-ID: 
 <91223B22862E2F4FBB9965783FF861430D8E335693@EXCHANGE07.webde.local>
Accept-Language: de-DE
Content-Language: de-DE
acceptlanguage: de-DE
Content-Type: multipart/alternative;
	boundary="_000_91223B22862E2F4FBB9965783FF861430D8E335693EXCHANGE07web_"
MIME-Version: 1.0

--_000_91223B22862E2F4FBB9965783FF861430D8E335693EXCHANGE07web_
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable

Hi all,

In my mapreduce job, I would like to process only whole input files contain=
ing only valid rows. If one map task processing an input split of a file de=
tects an invalid row, the whole file should be "marked" as invalid and not =
processed at all. This input file will then be cleansed by another process,=
 and taken again as input to the next run of my mapreduce job.

My first idea was to set a counter in the mapper after detecting an invalid=
 line with the name of the file as the counter name (derived from input spl=
it). Then additionally put the input filename to the map output value (whic=
h is already a MapWritable, so adding the filename is no problem). And in t=
he reducer I could filter out any rows belonging to the counters written in=
 the mapper.

Each job has some thousand input files. So in the worst case there could be=
 as many counters written to mark invalid input files. Is this a feasible a=
pproach? Does the framework guarantee that all counters written in the mapp=
ers are synchronized (visible) in the reducers? And could this number of co=
unters lead to OOME in the jobtracker?

Are there better approaches? I could also process the files using a non spl=
itable input format. Is there a way to reject the already outputted rows of=
 a the map task processing an input split?

Thanks,
Matthias


--_000_91223B22862E2F4FBB9965783FF861430D8E335693EXCHANGE07web_
Content-Type: text/html; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable

<html xmlns:v=3D"urn:schemas-microsoft-com:vml" xmlns:o=3D"urn:schemas-micr=
osoft-com:office:office" xmlns:w=3D"urn:schemas-microsoft-com:office:word" =
xmlns:m=3D"http://schemas.microsoft.com/office/2004/12/omml" xmlns=3D"http:=
//www.w3.org/TR/REC-html40"><head><meta http-equiv=3DContent-Type content=
=3D"text/html; charset=3Dus-ascii"><meta name=3DGenerator content=3D"Micros=
oft Word 12 (filtered medium)"><style><!--
/* Font Definitions */
@font-face
	{font-family:"Cambria Math";
	panose-1:2 4 5 3 5 4 6 3 2 4;}
@font-face
	{font-family:Calibri;
	panose-1:2 15 5 2 2 2 4 3 2 4;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
	{margin:0cm;
	margin-bottom:.0001pt;
	font-size:11.0pt;
	font-family:"Calibri","sans-serif";}
a:link, span.MsoHyperlink
	{mso-style-priority:99;
	color:blue;
	text-decoration:underline;}
a:visited, span.MsoHyperlinkFollowed
	{mso-style-priority:99;
	color:purple;
	text-decoration:underline;}
span.E-MailFormatvorlage17
	{mso-style-type:personal-compose;
	font-family:"Calibri","sans-serif";
	color:windowtext;}
.MsoChpDefault
	{mso-style-type:export-only;}
@page WordSection1
	{size:612.0pt 792.0pt;
	margin:70.85pt 70.85pt 2.0cm 70.85pt;}
div.WordSection1
	{page:WordSection1;}
--></style><!--[if gte mso 9]><xml>
<o:shapedefaults v:ext=3D"edit" spidmax=3D"1026" />
</xml><![endif]--><!--[if gte mso 9]><xml>
<o:shapelayout v:ext=3D"edit">
<o:idmap v:ext=3D"edit" data=3D"1" />
</o:shapelayout></xml><![endif]--></head><body lang=3DDE link=3Dblue vlink=
=3Dpurple><div class=3DWordSection1><p class=3DMsoNormal><span lang=3DEN-US=
>Hi all,<o:p></o:p></span></p><p class=3DMsoNormal><span lang=3DEN-US><o:p>=
&nbsp;</o:p></span></p><p class=3DMsoNormal><span lang=3DEN-US>In my mapred=
uce job, I would like to process only whole input files containing only val=
id rows. If one map task processing an input split of a file detects an inv=
alid row, the whole file should be &#8220;marked&#8221; as invalid and not =
processed at all. This input file will then be cleansed by another process,=
 and taken again as input to the next run of my mapreduce job.<o:p></o:p></=
span></p><p class=3DMsoNormal><span lang=3DEN-US><o:p>&nbsp;</o:p></span></=
p><p class=3DMsoNormal><span lang=3DEN-US>My first idea was to set a counte=
r in the mapper after detecting an invalid line with the name of the file a=
s the counter name (derived from input split). Then additionally put the in=
put filename to the map output value (which is already a MapWritable, so ad=
ding the filename is no problem). And in the reducer I could filter out any=
 rows belonging to the counters written in the mapper.<o:p></o:p></span></p=
><p class=3DMsoNormal><span lang=3DEN-US><o:p>&nbsp;</o:p></span></p><p cla=
ss=3DMsoNormal><span lang=3DEN-US>Each job has some thousand input files. S=
o in the worst case there could be as many counters written to mark invalid=
 input files. Is this a feasible approach? Does the framework guarantee tha=
t all counters written in the mappers are synchronized (visible) in the red=
ucers? And could this number of counters lead to OOME in the jobtracker?<o:=
p></o:p></span></p><p class=3DMsoNormal><span lang=3DEN-US><o:p>&nbsp;</o:p=
></span></p><p class=3DMsoNormal><span lang=3DEN-US>Are there better approa=
ches? I could also process the files using a non splitable input format. Is=
 there a way to reject the already outputted rows of a the map task process=
ing an input split?<o:p></o:p></span></p><p class=3DMsoNormal><span lang=3D=
EN-US><o:p>&nbsp;</o:p></span></p><p class=3DMsoNormal><span lang=3DEN-US>T=
hanks,<o:p></o:p></span></p><p class=3DMsoNormal><span lang=3DEN-US>Matthia=
s<o:p></o:p></span></p><p class=3DMsoNormal><span lang=3DEN-US><o:p>&nbsp;<=
/o:p></span></p></div></body></html>=

--_000_91223B22862E2F4FBB9965783FF861430D8E335693EXCHANGE07web_--