Return-Path: X-Original-To: apmail-hadoop-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 55A49100F4 for ; Thu, 18 Apr 2013 21:50:53 +0000 (UTC) Received: (qmail 42433 invoked by uid 500); 18 Apr 2013 21:50:48 -0000 Delivered-To: apmail-hadoop-user-archive@hadoop.apache.org Received: (qmail 42334 invoked by uid 500); 18 Apr 2013 21:50:48 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 42324 invoked by uid 99); 18 Apr 2013 21:50:48 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 18 Apr 2013 21:50:48 +0000 X-ASF-Spam-Status: No, hits=1.7 required=5.0 tests=FREEMAIL_ENVFROM_END_DIGIT,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of lordjoe2000@gmail.com designates 209.85.214.51 as permitted sender) Received: from [209.85.214.51] (HELO mail-bk0-f51.google.com) (209.85.214.51) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 18 Apr 2013 21:50:42 +0000 Received: by mail-bk0-f51.google.com with SMTP id y8so1481558bkt.38 for ; Thu, 18 Apr 2013 14:50:22 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:x-received:in-reply-to:references:date:message-id :subject:from:to:content-type; bh=vIMGqzQz5q6NhxD/iCWpmHPNYg6tRHWwi08fCXlCGMI=; b=jldYyln5qXWcjSHVdt2ss2FtIM7gBsBE+tCsig6IGuja44XYo81e5fVE6xAk8vpdkO Z6ZTJgzMWNfWDpcqgpURDfg6+Uj0/5b8u+OMnFQ3K8LTBaeF6Y7sTlql9+ZnNVKrAZnj 5+V7Ch9irJ///mMoLxmwO+47AFVeTetxG2CpFNA6yG+y6kYb0jxsv339+qll4qr9TGiR 5NeooridlceYO0hZ0LZ3Gspir9FMMJsP6CxEGQPG00RCTeKzJn5LTkx7/CCmvyLncqd4 iMKCv7SefTAvjSgoXnwSBnB0PQMdpBJcySWIhBcQGHbo8QKEziucTNSmDnrAb11z5Lox Bdiw== MIME-Version: 1.0 X-Received: by 10.204.201.1 with SMTP id ey1mr4612203bkb.110.1366321822086; Thu, 18 Apr 2013 14:50:22 -0700 (PDT) Received: by 10.204.11.88 with HTTP; Thu, 18 Apr 2013 14:50:21 -0700 (PDT) In-Reply-To: <91223B22862E2F4FBB9965783FF861430D8E335693@EXCHANGE07.webde.local> References: <91223B22862E2F4FBB9965783FF861430D8E335693@EXCHANGE07.webde.local> Date: Thu, 18 Apr 2013 14:50:21 -0700 Message-ID: Subject: Re: How to process only input files containing 100% valid rows From: Steve Lewis To: user@hadoop.apache.org Content-Type: multipart/alternative; boundary=485b3970d5cedafcbe04daa996e1 X-Virus-Checked: Checked by ClamAV on apache.org --485b3970d5cedafcbe04daa996e1 Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: quoted-printable With files that small it is much better to write a custom input format which checks the entire file and only passes records from good files. If you need Hadoop you are probably processing a large number of these files and an input format could easily read the entire file and handle it if it as as short as a few thousand lines On Thu, Apr 18, 2013 at 12:34 PM, Matthias Scherer < matthias.scherer@1und1.de> wrote: > Hi all,**** > > ** ** > > In my mapreduce job, I would like to process only whole input files > containing only valid rows. If one map task processing an input split of = a > file detects an invalid row, the whole file should be =93marked=94 as inv= alid > and not processed at all. This input file will then be cleansed by anothe= r > process, and taken again as input to the next run of my mapreduce job.***= * > > ** ** > > My first idea was to set a counter in the mapper after detecting an > invalid line with the name of the file as the counter name (derived from > input split). Then additionally put the input filename to the map output > value (which is already a MapWritable, so adding the filename is no > problem). And in the reducer I could filter out any rows belonging to the > counters written in the mapper.**** > > ** ** > > Each job has some thousand input files. So in the worst case there could > be as many counters written to mark invalid input files. Is this a feasib= le > approach? Does the framework guarantee that all counters written in the > mappers are synchronized (visible) in the reducers? And could this number > of counters lead to OOME in the jobtracker?**** > > ** ** > > Are there better approaches? I could also process the files using a non > splitable input format. Is there a way to reject the already outputted ro= ws > of a the map task processing an input split?**** > > ** ** > > Thanks,**** > > Matthias**** > > ** ** > --=20 Steven M. Lewis PhD 4221 105th Ave NE Kirkland, WA 98033 206-384-1340 (cell) Skype lordjoe_com --485b3970d5cedafcbe04daa996e1 Content-Type: text/html; charset=windows-1252 Content-Transfer-Encoding: quoted-printable
With files that small it is much better to write a custom = input format which checks the entire file and only passes records from good= files. If you need Hadoop you are probably processing a large number of th= ese files and an input format could easily read the entire file and handle = it if it as as short as a few thousand lines


On Thu, Apr 1= 8, 2013 at 12:34 PM, Matthias Scherer <matthias.scherer@1und1.de> wrote:

Hi all,<= /span>

=A0

In my mapreduce job, I would like = to process only whole input files containing only valid rows. If one map ta= sk processing an input split of a file detects an invalid row, the whole fi= le should be =93marked=94 as invalid and not processed at all. This input f= ile will then be cleansed by another process, and taken again as input to t= he next run of my mapreduce job.

=A0

My first idea was to set a counter= in the mapper after detecting an invalid line with the name of the file as= the counter name (derived from input split). Then additionally put the inp= ut filename to the map output value (which is already a MapWritable, so add= ing the filename is no problem). And in the reducer I could filter out any = rows belonging to the counters written in the mapper.<= /p>

=A0

Each job has some thousand input f= iles. So in the worst case there could be as many counters written to mark = invalid input files. Is this a feasible approach? Does the framework guaran= tee that all counters written in the mappers are synchronized (visible) in = the reducers? And could this number of counters lead to OOME in the jobtrac= ker?

=A0

Are there better approaches? I cou= ld also process the files using a non splitable input format. Is there a wa= y to reject the already outputted rows of a the map task processing an inpu= t split?

=A0

Thanks,

Matthias

=

=A0

<= /div>


--
Steven M. Lewis PhD
42= 21 105th Ave NE
Kirkland, WA 98033
206-384-1340 (cell)<= br> Skype lordjoe_com

--485b3970d5cedafcbe04daa996e1--