hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Wellington Chevreuil <wellington.chevre...@gmail.com>
Subject Re: How to process only input files containing 100% valid rows
Date Fri, 19 Apr 2013 10:35:19 GMT
How about use a combiner to mark as dirty all rows from a dirty file, for
instance, putting "dirty" flag as part of the key, then in the reducer you
can simply ignore this rows and/or output the bad file name.

It still will have to pass through the whole file, but at least avoids the
case where you could end up with too many counters...


2013/4/19 Matthias Scherer <matthias.scherer@1und1.de>

> I have to add that we have 1-2 Billion of Events per day, split to some
> thousands of files. So pre-reading each file in the InputFormat should be
> avoided.****
> ** **
> And yes, we could use MultipleOutputs and write bad files to process each
> input file. But we (our Operations team) think that there is more / better
> control if we reject whole files containing bad records.****
> ** **
> Regards****
> Matthias****

View raw message