hadoop-hdfs-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nitin Pawar <nitinpawar...@gmail.com>
Subject Re: AW: How to process only input files containing 100% valid rows
Date Fri, 19 Apr 2013 10:16:36 GMT
Reject the entire file even if a single record is invalid? There has to be
a eeal serious reason to take this approach
If not in any case to check the file has all valid lines you are opening
the files  and parsing them. Why not then parse + separate incorrect lines
as suggested in previous mails
That way it will give you count of invalid records as well you will not
miss the valid records for small number of invalid records in a file.
On Apr 19, 2013 3:23 PM, "Matthias Scherer" <matthias.scherer@1und1.de>
wrote:

> I have to add that we have 1-2 Billion of Events per day, split to some
> thousands of files. So pre-reading each file in the InputFormat should be
> avoided.****
>
> ** **
>
> And yes, we could use MultipleOutputs and write bad files to process each
> input file. But we (our Operations team) think that there is more / better
> control if we reject whole files containing bad records.****
>
> ** **
>
> Regards****
>
> Matthias****
>

Mime
View raw message