hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Matthias Scherer <matthias.sche...@1und1.de>
Subject How to process only input files containing 100% valid rows
Date Thu, 18 Apr 2013 19:34:54 GMT
Hi all,

In my mapreduce job, I would like to process only whole input files containing only valid
rows. If one map task processing an input split of a file detects an invalid row, the whole
file should be "marked" as invalid and not processed at all. This input file will then be
cleansed by another process, and taken again as input to the next run of my mapreduce job.

My first idea was to set a counter in the mapper after detecting an invalid line with the
name of the file as the counter name (derived from input split). Then additionally put the
input filename to the map output value (which is already a MapWritable, so adding the filename
is no problem). And in the reducer I could filter out any rows belonging to the counters written
in the mapper.

Each job has some thousand input files. So in the worst case there could be as many counters
written to mark invalid input files. Is this a feasible approach? Does the framework guarantee
that all counters written in the mappers are synchronized (visible) in the reducers? And could
this number of counters lead to OOME in the jobtracker?

Are there better approaches? I could also process the files using a non splitable input format.
Is there a way to reject the already outputted rows of a the map task processing an input
split?

Thanks,
Matthias


Mime
View raw message