drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jason Altekruse <altekruseja...@gmail.com>
Subject Re: Identifying the source of problematic records
Date Fri, 04 Sep 2015 01:40:49 GMT
I was thinking we would just put a catch around the calls to evaluate the
generated code and re-evaluate each individual expression with the
interpreter to find out which one caused the exception.

Thinking about it a little more, the call to the generated code actually
happens inside of the loop in the ProjectTemplate/FilterTemplate classes
today. This is where the information about the index in the current batch
is known, but the list of expressions is not known at this level. We might
have to add an interface to extract the last index we tried to evaluate
from the Template, so that we could use this to evaluate against the
correct row back in the RecordBatch where we have access to expressions
which can be used to materialize the interpreter.

On Thu, Sep 3, 2015 at 6:31 PM, Jacques Nadeau <jacques@dremio.com> wrote:

> Interesting idea.  The question I have is how would this work when you have
> a combination of generated code related to expressions and code not related
> to expressions.
>
> --
> Jacques Nadeau
> CTO and Co-Founder, Dremio
>
> On Thu, Sep 3, 2015 at 11:31 AM, Jason Altekruse <altekrusejason@gmail.com
> >
> wrote:
>
> > @Jacques,
> >
> > On your point a) about expressing failures and the compilation model, I
> had
> > thought about previously using the interpreter to figure out which
> > expression against the current row failed, once we have caught an
> exception
> > out of some part of the complete code-generated expression evaluation. Do
> > you think this would possibly address your concern? Do you think anything
> > more than the problematic input data and the expression that failed would
> > be produced by the functions in this new standardized error format?
> >
> > - Jason
> >
> > On Wed, Sep 2, 2015 at 8:43 PM, Jacques Nadeau <jacques@dremio.com>
> wrote:
> >
> > > I'd like to propose a few things to solve this:
> > >
> > > a) Functions should be able to express failures in a standardized way.
> > I'm
> > > thinking a new type of injectable and/or a certain type of exception
> > > (although more dangerous/possibly requires rewrite given compilation
> > > model).
> > > b) Users (session/system level) should be able to set a setting where
> > > function errors are handled a certain way. Options could include query
> > > failure, ignore + inform as warning/notice, and save records for later
> > > analysis (maybe in v2).
> > > c) Readers that have a notorious problem (e.g. Text) should support
> > > projection/expression pushdown so that they can create these kinds of
> > > errors and provide additional context as part of that.
> > > d) We should also implement dot drill files so that users can prescribe
> > > this projection/data validation process by default for files/diretories
> > > (which would provide the behavior as c above.
> > > e) We should get more serious about providing useful virtual fields.
> > This
> > > should include filename (similar to directory name).
> > >
> > > Once a record leaves an operator, I don't think we should carry any
> > > additional provenance with it. It would be too heavy weight as a
> default
> > > behavior.
> > >
> > >
> > >
> > >
> > >
> > >
> > > --
> > > Jacques Nadeau
> > > CTO and Co-Founder, Dremio
> > >
> > > On Tue, Sep 1, 2015 at 9:08 AM, Aman Sinha <asinha@maprtech.com>
> wrote:
> > >
> > > > Drill can point out the filename and location of corrupted records
> in a
> > > > file but we don't have a good mechanism to deal with the following
> > > > scenario:
> > > >
> > > > Consider a text file with 2 records:
> > > > $ cat t4.csv
> > > > 10,2001
> > > > 11,http://www.cnn.com
> > > >
> > > > 0: jdbc:drill:zk=local> alter session set `exec.errors.verbose` =
> true;
> > > >
> > > > 0: jdbc:drill:zk=local> select cast(columns[0] as init),
> > cast(columns[1]
> > > as
> > > > bigint) from dfs.`/Users/asinha/data/t4.csv`;
> > > >
> > > > Error: SYSTEM ERROR: NumberFormatException: http://www.cnn.com
> > > >
> > > > Fragment 0:0
> > > >
> > > > [Error Id: 72aad22c-a345-4100-9a57-dcd8436105f7 on
> 10.250.56.140:31010
> > ]
> > > >
> > > >   (java.lang.NumberFormatException) http://www.cnn.com
> > > >
>  org.apache.drill.exec.expr.fn.impl.StringFunctionHelpers.nfeL():91
> > > >
> > > >
> > >
> >
> org.apache.drill.exec.expr.fn.impl.StringFunctionHelpers.varCharToLong():62
> > > >     org.apache.drill.exec.test.generated.ProjectorGen1.doEval():62
> > > >
> > >  org.apache.drill.exec.test.generated.ProjectorGen1.projectRecords():62
> > > >
> > > >
> > >
> >
> org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.doWork():172
> > > >
> > > > The problem is user does not have a clue about the original source of
> > > this
> > > > error.  This is a pain point especially when dealing with thousands
> of
> > > > files.
> > > >
> > > > 1.  We can start by providing the column index where the problem
> > > occurred.
> > > > 2.  Can a scan batch keep track of the file it originated from ?
> Since
> > > the
> > > > Project in the
> > > >      above query is pushed right above the scan, it could get the
> > > filename
> > > > from the record
> > > >      batch (assuming we can store this piece of information).  This
> > won't
> > > > be possible
> > > >      for other Projects elsewhere in the plan.
> > > > 3.  What about the location within the file ?   Unless the projection
> > is
> > > > pushed into the scan
> > > >      itself, I don't see a good way to provide this information.
> > > >
> > > > A related topic is how to tell Drill to ignore such records when
> doing
> > a
> > > > query or a CTAS ?
> > > > That could be a separate discussion.
> > > >
> > > > Thoughts ?
> > > > Aman
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message