hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From MARCOS MEDRADO RUBINELLI <marc...@buscapecompany.com>
Subject Re: AW: How to process only input files containing 100% valid rows
Date Fri, 19 Apr 2013 11:21:09 GMT

As far as I know, there are no guarantees on when counters will be updated during the job.
One thing you can do is to write a metadata file along with your parsed events listing what
files have errors and should be ignored in the next step of your ETL workflow.

If you really don't want to have "dirty" records mixed in, you can accomplish it using secondary
sort. In a nutshell:
- create a composite key using filename and an enum BROKEN = 0, CLEAN = 1
- create a sorting comparator that ensures BROKEN comes before CLEAN
- create a grouping comparator and a partitioner on filename only, to ensure both BROKEN and
CLEAN are processed by the same reducer
- if you found a broken line, send it with a BROKEN key
- in the reducer, if you get a BROKEN key, write that filename somewhere so you know you will
have to scrub and re-submit it, and ignore both BROKEN and CLEAN records


On 19-04-2013 06:39, Matthias Scherer wote:
I have to add that we have 1-2 Billion of Events per day, split to some thousands of files.
So pre-reading each file in the InputFormat should be avoided.

And yes, we could use MultipleOutputs and write bad files to process each input file. But
we (our Operations team) think that there is more / better control if we reject whole files
containing bad records.


View raw message