Return-Path: X-Original-To: apmail-hadoop-hdfs-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-hdfs-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 9A0C610462 for ; Fri, 19 Apr 2013 10:17:09 +0000 (UTC) Received: (qmail 25113 invoked by uid 500); 19 Apr 2013 10:17:04 -0000 Delivered-To: apmail-hadoop-hdfs-user-archive@hadoop.apache.org Received: (qmail 24794 invoked by uid 500); 19 Apr 2013 10:17:04 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 24756 invoked by uid 99); 19 Apr 2013 10:17:02 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 19 Apr 2013 10:17:02 +0000 X-ASF-Spam-Status: No, hits=1.7 required=5.0 tests=FREEMAIL_ENVFROM_END_DIGIT,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of nitinpawar432@gmail.com designates 209.85.217.180 as permitted sender) Received: from [209.85.217.180] (HELO mail-lb0-f180.google.com) (209.85.217.180) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 19 Apr 2013 10:16:58 +0000 Received: by mail-lb0-f180.google.com with SMTP id t11so3569155lbi.39 for ; Fri, 19 Apr 2013 03:16:36 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:x-received:in-reply-to:references:date:message-id :subject:from:to:content-type; bh=tSeiP1k9ZE+ZIhVvPlbMZ8ylUUclTnxbmolYDrmrfCs=; b=SqVsnAyQAygFSR8+ZBIi9iRTwsVnqS1KNvuIpJJDv3+F/OC7r/mcrhWhdEVkKKlt1x n1Z9nW68Qo6zd0bl7s04XA0tXjb4p1ZCc57OSE8Vtzn2Ylkm+SCGdzUZY5xo8l49M597 b1vFTN5yLjJWnxq6Xsu1w9hghw9fQFzqLd34ZBwLIK6ak4eTaWEbOu5GamQYwOHvfeF2 AyTfCJhy1BNcSw7ZLQwMBT2cZvI1r2OEj0dAYj5vNlqvw3CAIZxXyFd2bQExK3qvGb1p EKqc1ju8x70h2My9k9+pfjkjpb8T/9kcLn+QqQhLhmKg7sTopLEYL6A6kw8g5FH+5ZMV ieUQ== MIME-Version: 1.0 X-Received: by 10.112.131.169 with SMTP id on9mr957777lbb.124.1366366596453; Fri, 19 Apr 2013 03:16:36 -0700 (PDT) Received: by 10.114.24.129 with HTTP; Fri, 19 Apr 2013 03:16:36 -0700 (PDT) Received: by 10.114.24.129 with HTTP; Fri, 19 Apr 2013 03:16:36 -0700 (PDT) In-Reply-To: <91223B22862E2F4FBB9965783FF861430D8E335742@EXCHANGE07.webde.local> References: <91223B22862E2F4FBB9965783FF861430D8E335693@EXCHANGE07.webde.local> <91223B22862E2F4FBB9965783FF861430D8E335742@EXCHANGE07.webde.local> Date: Fri, 19 Apr 2013 15:46:36 +0530 Message-ID: Subject: Re: AW: How to process only input files containing 100% valid rows From: Nitin Pawar To: user@hadoop.apache.org Content-Type: multipart/alternative; boundary=047d7b3a824c9d5c8c04dab403e5 X-Virus-Checked: Checked by ClamAV on apache.org --047d7b3a824c9d5c8c04dab403e5 Content-Type: text/plain; charset=ISO-8859-1 Reject the entire file even if a single record is invalid? There has to be a eeal serious reason to take this approach If not in any case to check the file has all valid lines you are opening the files and parsing them. Why not then parse + separate incorrect lines as suggested in previous mails That way it will give you count of invalid records as well you will not miss the valid records for small number of invalid records in a file. On Apr 19, 2013 3:23 PM, "Matthias Scherer" wrote: > I have to add that we have 1-2 Billion of Events per day, split to some > thousands of files. So pre-reading each file in the InputFormat should be > avoided.**** > > ** ** > > And yes, we could use MultipleOutputs and write bad files to process each > input file. But we (our Operations team) think that there is more / better > control if we reject whole files containing bad records.**** > > ** ** > > Regards**** > > Matthias**** > --047d7b3a824c9d5c8c04dab403e5 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable

Reject the entire file even if a single record is invalid? T= here has to be a eeal serious reason to take this approach
If not in any case to check the file has all valid lines you are opening th= e files=A0 and parsing them. Why not then parse + separate incorrect lines = as suggested in previous mails
That way it will give you count of invalid records as well you will not mis= s the valid records for small number of invalid records in a file.

On Apr 19, 2013 3:23 PM, "Matthias Scherer&= quot; <matthias.scherer@1un= d1.de> wrote:

I have to add that we = have 1-2 Billion of Events per day, split to some thousands of files. So pr= e-reading each file in the InputFormat should be avoided.

=A0=

And yes, we could use MultipleOutputs and write bad files to proces= s each input file. But we (our Operations team) think that there is more / = better control if we reject whole files containing bad records.

=A0=

Regards

Matthias

--047d7b3a824c9d5c8c04dab403e5--