drill-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mehant Baid (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (DRILL-3764) Support the ability to identify and/or skip records when a function evaluation fails
Date Fri, 09 Oct 2015 07:09:26 GMT

    [ https://issues.apache.org/jira/browse/DRILL-3764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14950005#comment-14950005
] 

Mehant Baid commented on DRILL-3764:
------------------------------------

I had worked on providing a similar functionality with [~jnadeau] on providing a framework
(annotations for errors in function template and necessary addition to the runtime code gen
to handle errors) to be able to deal with errors in function evaluation. Here is the branch,
https://github.com/mehant/drill/commit/3e81a776d1c1bb0ce7f64d8c5a905c87d71e42e0 (this is old,
most likely won't rebase cleanly, I can work on rebasing if deemed useful). The basic idea
was to provide a way to specify different type of errors within the UDF and in case of an
error use null for that row. 

> Support the ability to identify and/or skip records when a function evaluation fails
> ------------------------------------------------------------------------------------
>
>                 Key: DRILL-3764
>                 URL: https://issues.apache.org/jira/browse/DRILL-3764
>             Project: Apache Drill
>          Issue Type: Improvement
>          Components: Functions - Drill
>    Affects Versions: 1.1.0
>            Reporter: Aman Sinha
>             Fix For: Future
>
>
> Drill can point out the filename and location of corrupted records in a file but it does
not have a good mechanism to deal with the following scenario: 
> Consider a text file with 2 records:
> {code}
> $ cat t4.csv
> 10,2001
> 11,http://www.cnn.com
> {code}
> {code}
> 0: jdbc:drill:zk=local> alter session set `exec.errors.verbose` = true;
> 0: jdbc:drill:zk=local> select cast(columns[0] as init), cast(columns[1] as bigint)
from dfs.`t4.csv`;
> Error: SYSTEM ERROR: NumberFormatException: http://www.cnn.com
> Fragment 0:0
> [Error Id: 72aad22c-a345-4100-9a57-dcd8436105f7 on 10.250.56.140:31010]
>   (java.lang.NumberFormatException) http://www.cnn.com
>     org.apache.drill.exec.expr.fn.impl.StringFunctionHelpers.nfeL():91
>     org.apache.drill.exec.expr.fn.impl.StringFunctionHelpers.varCharToLong():62
>     org.apache.drill.exec.test.generated.ProjectorGen1.doEval():62
>     org.apache.drill.exec.test.generated.ProjectorGen1.projectRecords():62
>     org.apache.drill.exec.physical.impl.project.ProjectRecordBatch.doWork():172
> {code}
> The problem is user does not have the context of where the error occurred -either the
file name or the record number.   This becomes a pain point especially when CTAS is being
used to do data conversion from (say) text format to Parquet format.  The CTAS may be accessing
thousands of files and 1 such casting (or another function) failure aborts the query. 
> It would substantially improve the user experience if we provided: 
> 1) the filename and record number where  this failure occurred
> 2) the ability to skip such records depending on a session option
> 3) the ability to write such records to a staging table for future ingestion
> Please see discussion on dev list: 
> http://mail-archives.apache.org/mod_mbox/drill-dev/201509.mbox/%3cCAFyDVvLuPLgTNZ56S6=J=9Vb=aBs=pDw7NRHKkdUPbdxGFAdcg@mail.gmail.com%3e



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message