Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hadoop.apache.org
Received-SPF: pass (nike.apache.org: domain of lordjoe2000@gmail.com
 designates 209.85.214.51 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <91223B22862E2F4FBB9965783FF861430D8E335693@EXCHANGE07.webde.local>
References: 
 <91223B22862E2F4FBB9965783FF861430D8E335693@EXCHANGE07.webde.local>
Date: Thu, 18 Apr 2013 14:50:21 -0700
Message-ID: 
 <CALEj8eOe-kN8+ZSOX+NEvLmipZE=r-2RdSC7qwo0tssPnX13KQ@mail.gmail.com>
Subject: Re: How to process only input files containing 100% valid rows
From: Steve Lewis <lordjoe2000@gmail.com>
To: user@hadoop.apache.org
Content-Type: multipart/alternative; boundary=485b3970d5cedafcbe04daa996e1

--485b3970d5cedafcbe04daa996e1
Content-Type: text/plain; charset=windows-1252
Content-Transfer-Encoding: quoted-printable

With files that small it is much better to write a custom input format
which checks the entire file and only passes records from good files. If
you need Hadoop you are probably processing a large number of these files
and an input format could easily read the entire file and handle it if it
as as short as a few thousand lines


On Thu, Apr 18, 2013 at 12:34 PM, Matthias Scherer <
matthias.scherer@1und1.de> wrote:

> Hi all,****
>
> ** **
>
> In my mapreduce job, I would like to process only whole input files
> containing only valid rows. If one map task processing an input split of =
a
> file detects an invalid row, the whole file should be =93marked=94 as inv=
alid
> and not processed at all. This input file will then be cleansed by anothe=
r
> process, and taken again as input to the next run of my mapreduce job.***=
*
>
> ** **
>
> My first idea was to set a counter in the mapper after detecting an
> invalid line with the name of the file as the counter name (derived from
> input split). Then additionally put the input filename to the map output
> value (which is already a MapWritable, so adding the filename is no
> problem). And in the reducer I could filter out any rows belonging to the
> counters written in the mapper.****
>
> ** **
>
> Each job has some thousand input files. So in the worst case there could
> be as many counters written to mark invalid input files. Is this a feasib=
le
> approach? Does the framework guarantee that all counters written in the
> mappers are synchronized (visible) in the reducers? And could this number
> of counters lead to OOME in the jobtracker?****
>
> ** **
>
> Are there better approaches? I could also process the files using a non
> splitable input format. Is there a way to reject the already outputted ro=
ws
> of a the map task processing an input split?****
>
> ** **
>
> Thanks,****
>
> Matthias****
>
> ** **
>


--=20
Steven M. Lewis PhD
4221 105th Ave NE
Kirkland, WA 98033
206-384-1340 (cell)
Skype lordjoe_com

--485b3970d5cedafcbe04daa996e1
Content-Type: text/html; charset=windows-1252
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">With files that small it is much better to write a custom =
input format which checks the entire file and only passes records from good=
 files. If you need Hadoop you are probably processing a large number of th=
ese files and an input format could easily read the entire file and handle =
it if it as as short as a few thousand lines</div>
<div class=3D"gmail_extra"><br><br><div class=3D"gmail_quote">On Thu, Apr 1=
8, 2013 at 12:34 PM, Matthias Scherer <span dir=3D"ltr">&lt;<a href=3D"mail=
to:matthias.scherer@1und1.de" target=3D"_blank">matthias.scherer@1und1.de</=
a>&gt;</span> wrote:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex"><div lang=3D"DE" link=3D"blue" vlink=3D"purp=
le"><div><p class=3D"MsoNormal"><span lang=3D"EN-US">Hi all,<u></u><u></u><=
/span></p>
<p class=3D"MsoNormal"><span lang=3D"EN-US"><u></u>=A0<u></u></span></p><p =
class=3D"MsoNormal"><span lang=3D"EN-US">In my mapreduce job, I would like =
to process only whole input files containing only valid rows. If one map ta=
sk processing an input split of a file detects an invalid row, the whole fi=
le should be =93marked=94 as invalid and not processed at all. This input f=
ile will then be cleansed by another process, and taken again as input to t=
he next run of my mapreduce job.<u></u><u></u></span></p>
<p class=3D"MsoNormal"><span lang=3D"EN-US"><u></u>=A0<u></u></span></p><p =
class=3D"MsoNormal"><span lang=3D"EN-US">My first idea was to set a counter=
 in the mapper after detecting an invalid line with the name of the file as=
 the counter name (derived from input split). Then additionally put the inp=
ut filename to the map output value (which is already a MapWritable, so add=
ing the filename is no problem). And in the reducer I could filter out any =
rows belonging to the counters written in the mapper.<u></u><u></u></span><=
/p>
<p class=3D"MsoNormal"><span lang=3D"EN-US"><u></u>=A0<u></u></span></p><p =
class=3D"MsoNormal"><span lang=3D"EN-US">Each job has some thousand input f=
iles. So in the worst case there could be as many counters written to mark =
invalid input files. Is this a feasible approach? Does the framework guaran=
tee that all counters written in the mappers are synchronized (visible) in =
the reducers? And could this number of counters lead to OOME in the jobtrac=
ker?<u></u><u></u></span></p>
<p class=3D"MsoNormal"><span lang=3D"EN-US"><u></u>=A0<u></u></span></p><p =
class=3D"MsoNormal"><span lang=3D"EN-US">Are there better approaches? I cou=
ld also process the files using a non splitable input format. Is there a wa=
y to reject the already outputted rows of a the map task processing an inpu=
t split?<u></u><u></u></span></p>
<p class=3D"MsoNormal"><span lang=3D"EN-US"><u></u>=A0<u></u></span></p><p =
class=3D"MsoNormal"><span lang=3D"EN-US">Thanks,<u></u><u></u></span></p><p=
 class=3D"MsoNormal"><span lang=3D"EN-US">Matthias<u></u><u></u></span></p>=
<p class=3D"MsoNormal">
<span lang=3D"EN-US"><u></u>=A0<u></u></span></p></div></div></blockquote><=
/div><br><br clear=3D"all"><div><br></div>-- <br>Steven M. Lewis PhD<div>42=
21 105th Ave NE</div><div>Kirkland, WA 98033</div><div>206-384-1340 (cell)<=
br>
Skype lordjoe_com<br><br></div>
</div>

--485b3970d5cedafcbe04daa996e1--