Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hadoop.apache.org
Received-SPF: pass (athena.apache.org: domain of nitinpawar432@gmail.com
 designates 209.85.217.180 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <91223B22862E2F4FBB9965783FF861430D8E335742@EXCHANGE07.webde.local>
References: 
 <91223B22862E2F4FBB9965783FF861430D8E335693@EXCHANGE07.webde.local>
	<CALEj8eOe-kN8+ZSOX+NEvLmipZE=r-2RdSC7qwo0tssPnX13KQ@mail.gmail.com>
	<91223B22862E2F4FBB9965783FF861430D8E335742@EXCHANGE07.webde.local>
Date: Fri, 19 Apr 2013 15:46:36 +0530
Message-ID: 
 <CAORpBsj9jWwx5MOYEhTMU+bE6kNzuC3v9YN_PARp5jpQW639VQ@mail.gmail.com>
Subject: Re: AW: How to process only input files containing 100% valid rows
From: Nitin Pawar <nitinpawar432@gmail.com>
To: user@hadoop.apache.org
Content-Type: multipart/alternative; boundary=047d7b3a824c9d5c8c04dab403e5

--047d7b3a824c9d5c8c04dab403e5
Content-Type: text/plain; charset=ISO-8859-1

Reject the entire file even if a single record is invalid? There has to be
a eeal serious reason to take this approach
If not in any case to check the file has all valid lines you are opening
the files  and parsing them. Why not then parse + separate incorrect lines
as suggested in previous mails
That way it will give you count of invalid records as well you will not
miss the valid records for small number of invalid records in a file.
On Apr 19, 2013 3:23 PM, "Matthias Scherer" <matthias.scherer@1und1.de>
wrote:

> I have to add that we have 1-2 Billion of Events per day, split to some
> thousands of files. So pre-reading each file in the InputFormat should be
> avoided.****
>
> ** **
>
> And yes, we could use MultipleOutputs and write bad files to process each
> input file. But we (our Operations team) think that there is more / better
> control if we reject whole files containing bad records.****
>
> ** **
>
> Regards****
>
> Matthias****
>

--047d7b3a824c9d5c8c04dab403e5
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

<p dir=3D"ltr">Reject the entire file even if a single record is invalid? T=
here has to be a eeal serious reason to take this approach <br>
If not in any case to check the file has all valid lines you are opening th=
e files=A0 and parsing them. Why not then parse + separate incorrect lines =
as suggested in previous mails <br>
That way it will give you count of invalid records as well you will not mis=
s the valid records for small number of invalid records in a file.</p>
<div class=3D"gmail_quote">On Apr 19, 2013 3:23 PM, &quot;Matthias Scherer&=
quot; &lt;<a href=3D"mailto:matthias.scherer@1und1.de">matthias.scherer@1un=
d1.de</a>&gt; wrote:<br type=3D"attribution"><blockquote class=3D"gmail_quo=
te" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"=
>
<div lang=3D"DE" link=3D"blue" vlink=3D"purple"><div><div><div><p class=3D"=
MsoNormal"><span lang=3D"EN-US" style=3D"font-size:11.0pt;font-family:&quot=
;Calibri&quot;,&quot;sans-serif&quot;;color:#1f497d">I have to add that we =
have 1-2 Billion of Events per day, split to some thousands of files. So pr=
e-reading each file in the InputFormat should be avoided.<u></u><u></u></sp=
an></p>
<p class=3D"MsoNormal"><span lang=3D"EN-US" style=3D"font-size:11.0pt;font-=
family:&quot;Calibri&quot;,&quot;sans-serif&quot;;color:#1f497d"><u></u>=A0=
<u></u></span></p><p class=3D"MsoNormal"><span lang=3D"EN-US" style=3D"font=
-size:11.0pt;font-family:&quot;Calibri&quot;,&quot;sans-serif&quot;;color:#=
1f497d">And yes, we could use MultipleOutputs and write bad files to proces=
s each input file. But we (our Operations team) think that there is more / =
better control if we reject whole files containing bad records.<u></u><u></=
u></span></p>
<p class=3D"MsoNormal"><span lang=3D"EN-US" style=3D"font-size:11.0pt;font-=
family:&quot;Calibri&quot;,&quot;sans-serif&quot;;color:#1f497d"><u></u>=A0=
<u></u></span></p><p class=3D"MsoNormal"><span lang=3D"EN-US" style=3D"font=
-size:11.0pt;font-family:&quot;Calibri&quot;,&quot;sans-serif&quot;;color:#=
1f497d">Regards<u></u><u></u></span></p>
<p class=3D"MsoNormal"><span lang=3D"EN-US" style=3D"font-size:11.0pt;font-=
family:&quot;Calibri&quot;,&quot;sans-serif&quot;;color:#1f497d">Matthias<u=
></u><u></u></span></p></div></div></div></div></blockquote></div>

--047d7b3a824c9d5c8c04dab403e5--