drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jacques Nadeau <jacq...@dremio.com>
Subject Re: Identifying the source of problematic records
Date Fri, 04 Sep 2015 01:31:40 GMT
Interesting idea.  The question I have is how would this work when you have
a combination of generated code related to expressions and code not related
to expressions.

--
Jacques Nadeau
CTO and Co-Founder, Dremio

On Thu, Sep 3, 2015 at 11:31 AM, Jason Altekruse <altekrusejason@gmail.com>
wrote:

> @Jacques,
>
> On your point a) about expressing failures and the compilation model, I had
> thought about previously using the interpreter to figure out which
> expression against the current row failed, once we have caught an exception
> out of some part of the complete code-generated expression evaluation. Do
> you think this would possibly address your concern? Do you think anything
> more than the problematic input data and the expression that failed would
> be produced by the functions in this new standardized error format?
>
> - Jason
>
> On Wed, Sep 2, 2015 at 8:43 PM, Jacques Nadeau <jacques@dremio.com> wrote:
>
> > I'd like to propose a few things to solve this:
> >
> > a) Functions should be able to express failures in a standardized way.
> I'm
> > thinking a new type of injectable and/or a certain type of exception
> > (although more dangerous/possibly requires rewrite given compilation
> > model).
> > b) Users (session/system level) should be able to set a setting where
> > function errors are handled a certain way. Options could include query
> > failure, ignore + inform as warning/notice, and save records for later
> > analysis (maybe in v2).
> > c) Readers that have a notorious problem (e.g. Text) should support
> > projection/expression pushdown so that they can create these kinds of
> > errors and provide additional context as part of that.
> > d) We should also implement dot drill files so that users can prescribe
> > this projection/data validation process by default for files/diretories
> > (which would provide the behavior as c above.
> > e) We should get more serious about providing useful virtual fields.
> This
> > should include filename (similar to directory name).
> >
> > Once a record leaves an operator, I don't think we should carry any
> > additional provenance with it. It would be too heavy weight as a default
> > behavior.
> >
> >
> >
> >
> >
> >
> > --
> > Jacques Nadeau
> > CTO and Co-Founder, Dremio
> >
> > On Tue, Sep 1, 2015 at 9:08 AM, Aman Sinha <asinha@maprtech.com> wrote:
> >
> > > Drill can point out the filename and location of corrupted records in a
> > > file but we don't have a good mechanism to deal with the following
> > > scenario:
> > >
> > > Consider a text file with 2 records:
> > > $ cat t4.csv
> > > 10,2001
> > > 11,http://www.cnn.com
> > >
> > > 0: jdbc:drill:zk=local> alter session set `exec.errors.verbose` = true;
> > >
> > > 0: jdbc:drill:zk=local> select cast(columns[0] as init),
> cast(columns[1]
> > as
> > > bigint) from dfs.`/Users/asinha/data/t4.csv`;
> > >
> > > Error: SYSTEM ERROR: NumberFormatException: http://www.cnn.com
> > >
> > > Fragment 0:0
> > >
> > > [Error Id: 72aad22c-a345-4100-9a57-dcd8436105f7 on 10.250.56.140:31010
> ]
> > >
> > >   (java.lang.NumberFormatException) http://www.cnn.com
> > >     org.apache.drill.exec.expr.fn.impl.StringFunctionHelpers.nfeL():91
> > >
> > >
> >
> org.apache.drill.exec.expr.fn.impl.StringFunctionHelpers.varCharToLong():62
> > >     org.apache.drill.exec.test.generated.ProjectorGen1.doEval():62
> > >
> >  org.apache.drill.exec.test.generated.ProjectorGen1.projectRecords():62
> > >
> > >
> >
> org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.doWork():172
> > >
> > > The problem is user does not have a clue about the original source of
> > this
> > > error.  This is a pain point especially when dealing with thousands of
> > > files.
> > >
> > > 1.  We can start by providing the column index where the problem
> > occurred.
> > > 2.  Can a scan batch keep track of the file it originated from ? Since
> > the
> > > Project in the
> > >      above query is pushed right above the scan, it could get the
> > filename
> > > from the record
> > >      batch (assuming we can store this piece of information).  This
> won't
> > > be possible
> > >      for other Projects elsewhere in the plan.
> > > 3.  What about the location within the file ?   Unless the projection
> is
> > > pushed into the scan
> > >      itself, I don't see a good way to provide this information.
> > >
> > > A related topic is how to tell Drill to ignore such records when doing
> a
> > > query or a CTAS ?
> > > That could be a separate discussion.
> > >
> > > Thoughts ?
> > > Aman
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message