Return-Path: X-Original-To: apmail-hadoop-hdfs-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-hdfs-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 09602FEA8 for ; Thu, 18 Apr 2013 19:36:03 +0000 (UTC) Received: (qmail 22951 invoked by uid 500); 18 Apr 2013 19:35:58 -0000 Delivered-To: apmail-hadoop-hdfs-user-archive@hadoop.apache.org Received: (qmail 22850 invoked by uid 500); 18 Apr 2013 19:35:58 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 22843 invoked by uid 99); 18 Apr 2013 19:35:57 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 18 Apr 2013 19:35:57 +0000 X-ASF-Spam-Status: No, hits=-2.8 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_HI,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of matthias.scherer@1und1.de designates 212.227.126.204 as permitted sender) Received: from [212.227.126.204] (HELO mxintern.schlund.de) (212.227.126.204) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 18 Apr 2013 19:35:51 +0000 Received: from [10.2.3.43] (helo=exnlb01.webde.local) by mxintern.schlund.de with esmtps (TLS1.0:RSA_AES_128_CBC_SHA1:16) (envelope-from ) id 1USucF-000Ols-4O for user@hadoop.apache.org; Thu, 18 Apr 2013 21:35:31 +0200 Received: from exchange07.webde.local ([169.254.2.232]) by exnlb01.webde.local ([10.2.3.43]) with mapi; Thu, 18 Apr 2013 21:35:30 +0200 From: Matthias Scherer To: "user@hadoop.apache.org" Date: Thu, 18 Apr 2013 21:34:54 +0200 Subject: How to process only input files containing 100% valid rows Thread-Topic: How to process only input files containing 100% valid rows Thread-Index: Ac48a80eVcra85XuSEautt1oLVi3eg== Message-ID: <91223B22862E2F4FBB9965783FF861430D8E335693@EXCHANGE07.webde.local> Accept-Language: de-DE Content-Language: de-DE X-MS-Has-Attach: X-MS-TNEF-Correlator: acceptlanguage: de-DE Content-Type: multipart/alternative; boundary="_000_91223B22862E2F4FBB9965783FF861430D8E335693EXCHANGE07web_" MIME-Version: 1.0 X-Virus-Scanned: Symantec AntiVirus Scan Engine X-UI-Msg-Verification: c2e8b68f88900b252d020373566e24a7 X-Virus-Checked: Checked by ClamAV on apache.org --_000_91223B22862E2F4FBB9965783FF861430D8E335693EXCHANGE07web_ Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable Hi all, In my mapreduce job, I would like to process only whole input files contain= ing only valid rows. If one map task processing an input split of a file de= tects an invalid row, the whole file should be "marked" as invalid and not = processed at all. This input file will then be cleansed by another process,= and taken again as input to the next run of my mapreduce job. My first idea was to set a counter in the mapper after detecting an invalid= line with the name of the file as the counter name (derived from input spl= it). Then additionally put the input filename to the map output value (whic= h is already a MapWritable, so adding the filename is no problem). And in t= he reducer I could filter out any rows belonging to the counters written in= the mapper. Each job has some thousand input files. So in the worst case there could be= as many counters written to mark invalid input files. Is this a feasible a= pproach? Does the framework guarantee that all counters written in the mapp= ers are synchronized (visible) in the reducers? And could this number of co= unters lead to OOME in the jobtracker? Are there better approaches? I could also process the files using a non spl= itable input format. Is there a way to reject the already outputted rows of= a the map task processing an input split? Thanks, Matthias --_000_91223B22862E2F4FBB9965783FF861430D8E335693EXCHANGE07web_ Content-Type: text/html; charset="us-ascii" Content-Transfer-Encoding: quoted-printable

Hi all,

=  

In my mapred= uce job, I would like to process only whole input files containing only val= id rows. If one map task processing an input split of a file detects an inv= alid row, the whole file should be “marked” as invalid and not = processed at all. This input file will then be cleansed by another process,= and taken again as input to the next run of my mapreduce job.

 

My first idea was to set a counte= r in the mapper after detecting an invalid line with the name of the file a= s the counter name (derived from input split). Then additionally put the in= put filename to the map output value (which is already a MapWritable, so ad= ding the filename is no problem). And in the reducer I could filter out any= rows belonging to the counters written in the mapper.

 

Each job has some thousand input files. S= o in the worst case there could be as many counters written to mark invalid= input files. Is this a feasible approach? Does the framework guarantee tha= t all counters written in the mappers are synchronized (visible) in the red= ucers? And could this number of counters lead to OOME in the jobtracker?

 

Are there better approa= ches? I could also process the files using a non splitable input format. Is= there a way to reject the already outputted rows of a the map task process= ing an input split?

 

T= hanks,

Matthia= s

 <= /o:p>

= --_000_91223B22862E2F4FBB9965783FF861430D8E335693EXCHANGE07web_--