pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dmitriy Ryaboy <dvrya...@gmail.com>
Subject Re: Exception Handling in Pig Scripts
Date Tue, 18 Jan 2011 23:24:23 GMT
We should think more about the interface.
For example, "Tuple input" argument -- is that the tuple that was passed to
the udf, or the whole tuple that was being processed? I can see wanting
both.
Also the Handler should probably have init and finish methods in case some
accumulation is happening, or state needs to get set up...

not sure about "splitting" into a table. Maybe more like

A = FOREACH FOO GENERATE Bar(*) ON_ERROR [use] MyHandler SPLIT [into]
A_ERRORS;

"use" and "into" are optional syntactic sugar.

This allows us to do any combination of:
- die
- put original record into a table
- process the error using a custom handler (which can increment counters,
write to dbs, send tweets... definitely send tweets...)

D

On Tue, Jan 18, 2011 at 10:27 AM, Julien Le Dem <ledemj@yahoo-inc.com>wrote:

> That would be nice.
> Also letting the error handler output the result to a relation would be
> useful.
> (To let the script output application error metrics)
> For example it could (optionally) use the keyword INTO just like the SPLIT
> operator.
>
> FOO = LOAD ...;
> A = FOREACH FOO GENERATE Bar(*) ON_ERROR SPLIT MyHandler INTO A_ERRORS;
>
> ErrorHandler would look a little more like EvalFunc:
>
> public interface ErrorHandler<T> {
>
>  public T handle(IOExcetion ioe, EvalFunc evalFunc, Tuple input) throws
> IOException;
>
> public Schema outputSchema(Schema input);
>
> }
>
> There could be a built-in handler to output the skipped record (input:
> tuple, funcname:chararray, errorMessage:chararray)
>
> A = FOREACH FOO GENERATE Bar(*) ON_ERROR SPLIT INTO A_ERRORS;
>
> Julien
>
> On 1/16/11 12:22 AM, "Dmitriy Ryaboy" <dvryaboy@gmail.com> wrote:
>
> I was thinking about this..
>
> We add an optional ON_ERROR clause to operators, which allows a user to
> specify error handling. The error handler would be a udf that would
> implement an interface along these lines:
>
> public interface ErrorHandler {
>
>  public void handle(IOExcetion ioe, EvalFunc evalFunc, Tuple input) throws
> IOException;
>
> }
>
> I think this makes sense not to make a static method so that users could
> keep required state, and for example have the handler throw its own
> IOException of it's been invoked too many times.
>
> D
>
>
> On Sat, Jan 15, 2011 at 11:53 PM, Santhosh Srinivasan <sms@yahoo-inc.com
> >wrote:
>
> > Thanks for the clarification Ashutosh.
> >
> > Implementing this in the user realm is tricky as Dmitriy states.
> > Sensitivity to error thresholds will require support from the system. We
> can
> > probably provide a taxonomy of records (good, bad, incomplete, etc.) to
> let
> > users classify each record. The system can then track counts of each
> record
> > type to facilitate the computation of thresholds. The last part is to
> allow
> > users to specify thresholds and appropriate actions (interrupt, exit,
> > continue, etc.). A possible mechanism to realize this is the
> > ErrorHandlingUDF described by Dmitriy.
> >
> > Santhosh
> >
> > -----Original Message-----
> > From: Ashutosh Chauhan [mailto:hashutosh@apache.org]
> > Sent: Friday, January 14, 2011 7:35 PM
> > To: user@pig.apache.org
> > Subject: Re: Exception Handling in Pig Scripts
> >
> > Santhosh,
> >
> > The way you are proposing, it will kill the pig script. I think what user
> > wants is to ignore few "bad records" and to process the rest and get
> > results. Problem here is how to let user tell Pig the definition of "bad
> > record" and how to let him specify threshold for % of bad records at
> which
> > Pig should fail the script.
> >
> > Ashutosh
> >
> > On Fri, Jan 14, 2011 at 18:18, Santhosh Srinivasan <sms@yahoo-inc.com>
> > wrote:
> > > Sorry about the late response.
> > >
> > > Hadoop n00b is proposing a language extension for error handling,
> similar
> > to the mechanisms in other well known languages like C++, Java, etc.
> > >
> > > For now, can't the error semantics be handled by the UDF? For
> exceptional
> > scenarios you could throw an ExecException with the right details. The
> > physical operator that handles the execution of UDF's traps it for you
> and
> > propagates the error back to the client. You can take a look at any of
> the
> > builtin UDFs to see how Pig handles it internally.
> > >
> > > Santhosh
> > >
> > > -----Original Message-----
> > > From: Dmitriy Ryaboy [mailto:dvryaboy@gmail.com]
> > > Sent: Tuesday, January 11, 2011 10:41 AM
> > > To: user@pig.apache.org
> > > Subject: Re: Exception Handling in Pig Scripts
> > >
> > > Right now error handling is controlled by the UDFs themselves, and
> there
> > is no way to direct it externally.
> > > You can make an ErrorHandlingUDF that would take a udf spec, invoke it,
> > trap errors, and then do the specified error handling behavior.. that's a
> > bit ugly though.
> > >
> > > There is a problem with trapping general exceptions of course, in that
> if
> > they happen 0.000001% of the time you can probably just ignore them, but
> if
> > they happen in half your dataset, you want the job to tell you something
> is
> > wrong. So this stuff gets non-trivial. If anyone wants to propose a
> design
> > to solve this general problem, I think that would be a welcome addition.
> > >
> > > D
> > >
> > > On Tue, Jan 11, 2011 at 12:47 AM, hadoop n00b <new2hive@gmail.com>
> > wrote:
> > >
> > >> Thanks, I sometimes get a date like 0001-01-01. This would be a valid
> > >> date format, but when I try to get the seconds between this and
> > >> another date, say 2011-01-01, I get an error that the value is too
> > >> large to be fit into int and the process stops. Do we have something
> > >> like ifError(x-y, null,x-y)? Or would I have to implement this as an
> > >> UDF?
> > >>
> > >> Thanks
> > >>
> > >> On Tue, Jan 11, 2011 at 11:40 AM, Dmitriy Ryaboy <dvryaboy@gmail.com>
> > >> wrote:
> > >>
> > >> > Create a UDF that verifies the format, and go through a filtering
> > >> > step first.
> > >> > If you would like to save the malformated records so you can look
> > >> > at them later, you can use the SPLIT operator to route the good
> > >> > records to your regular workflow, and the bad records some place on
> > HDFS.
> > >> >
> > >> > -D
> > >> >
> > >> > On Mon, Jan 10, 2011 at 9:58 PM, hadoop n00b <new2hive@gmail.com>
> > wrote:
> > >> >
> > >> > > Hello,
> > >> > >
> > >> > > I have a pig script that uses piggy bank to calculate date
> > differences.
> > >> > > Sometimes, when I get a wierd date or wrong format in the input,
> > >> > > the
> > >> > script
> > >> > > throws and error and aborts.
> > >> > >
> > >> > > Is there a way I could trap these errors and move on without
> > >> > > stopping
> > >> the
> > >> > > execution?
> > >> > >
> > >> > > Thanks
> > >> > >
> > >> > > PS: I'm using CDH2 with Pig 0.5
> > >> > >
> > >> >
> > >>
> > >
> >
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message