hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mingxi Wu <Mingxi...@turn.com>
Subject RE: how to implement error thresholds in a map-reduce job ?
Date Tue, 15 Nov 2011 21:49:19 GMT

JJ,

Two passes are necessary. First pass, just count how many lines are wrong. You won't do any
work on the data. It's just read the data. After this pass, record the file status "good"/"bad"
in a status file.

The second pass, before you start, check the file status file, and if the input file is marked
as good, go ahead. Otherwise, halt.

By dynamic counter, I mean the counter group with dynamic member name determined at run time.
For example, below I defined a counter group named fileSanity, and I have two members in this
group.
One is inputFileName:NORMAL, the other is inputFileName:MALFORMED.

public enum DataSanityType {
       NORMAL, //a good data line
       MALFORMED //a bad data line
}

In your mapper, add

If (line parsed successfully)
reporter.incrCounter("fileSanity", inputFileName +``:"+ NORMAL, 1);
else
reporter.incrCounter("fileSanity", inputFileName + ``:"+ MALFORMED, 1);

In your reducer close function,

Close(){

Int totalCnt = 0;

totalCnt = myReporter.getCounter("fileSanity", inputFileName +``:"+ NORMAL).getValue()+reporter.incrCounter("fileSanity",
inputFileName + ``:"+ MALFORMED, 1);

if (reporter.incrCounter("fileSanity", inputFileName + ``:"+ MALFORMED, 1)/totalCnt < threshold)
  mark the file as bad in a HDFS status file (a file you create)
}

Hope this helps.

Mingxi




From: Mapred Learn [mailto:mapred.learn@gmail.com]
Sent: Tuesday, November 15, 2011 11:10 AM
To: mapreduce-user@hadoop.apache.org
Subject: Re: how to implement error thresholds in a map-reduce job ?

Hi Mingxi,
By dynamic counter you mean custom counter or is it a different kind of counter ?

plus I cannot do 2 passes as I ge to know about errors in record only when I parse the line.
Thanks,
-JJ
On Mon, Nov 14, 2011 at 3:38 PM, Mingxi Wu <Mingxi.Wu@turn.com<mailto:Mingxi.Wu@turn.com>>
wrote:
You can do two passes of the data.
The first map-reduce pass is sanity checking the data.
The second map-reduce pass is to do the real work assuming the first pass accept the file.

You can utilize the dynamic counter and define an enum type for error record categories.
In the mapper, you parse each line, and use the result to update the counter.

-Mingxi

From: Mapred Learn [mailto:mapred.learn@gmail.com<mailto:mapred.learn@gmail.com>]
Sent: Monday, November 14, 2011 3:06 PM
To: mapreduce-user@hadoop.apache.org<mailto:mapreduce-user@hadoop.apache.org>
Subject: how to implement error thresholds in a map-reduce job ?

Hi,

I have a use  case where I want to pass a threshold value to a map-reduce job. For eg: error
records=10.

I want map-reduce job to fail if total count of error_records in the job i.e. all mappers,
is reached.

How can I implement this considering that each mapper would be processing some part of the
input data ?

Thanks,
-JJ


Mime
View raw message