drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stefán Baxter (JIRA) <j...@apache.org>
Subject [jira] [Created] (DRILL-3561) The return type of UDF seems ambigious and varchar results can force arithmetic comparison
Date Mon, 27 Jul 2015 06:49:04 GMT
Stefán Baxter created DRILL-3561:
------------------------------------

             Summary: The return type of UDF seems ambigious and varchar results can force
arithmetic  comparison
                 Key: DRILL-3561
                 URL: https://issues.apache.org/jira/browse/DRILL-3561
             Project: Apache Drill
          Issue Type: Bug
          Components: SQL Parser
            Reporter: Stefán Baxter
            Assignee: Aman Sinha


Please see information in the following user group email where dir0, containing "2011-2012"
is being compared to a varchar/null result of a UDF and Drill tries to convert dir0 to a number
and fails.

======================================================
I'm still working on our evaluation and now focusing on directory based queries and mixing
directory and parquet based partitions.
This is also a continued trip down the UDF rabbit hole :) (some pun intended)

I continue to come across things that surprise me and I would like to share them with both
the developers, that might want to address some of them, and other newcomers that might benefit
from them.

The UDF code referenced here van be found at (https://github.com/acmeguy/asdrill) and the
documents and the directory structure used in these examples are included in the tiny attachment.

I will try to keep this as brief as possible.

What you need to know is:
there are 33 records in the 19 files in a mixed directory structure 
- see zip for details

each record contains a date that is valid within that directory structure

the dirInRange function is a UDF that takes a date range and directory information to determine
if a directory contains target data for the provided date range
- see github for details

the dirInRange function should be able to accept all null values or missing parameters for
everything other than the time range starts parameter

the dirInRange function returns a string with a number that represents the number of parameters
used (function variant)
- has no other purpose/function at this point
- will return the value of the last dirN paramater that is not null (dir0, dir1 or dir2)
Observations

1. The UDF function (dirInRange) is not called if dir0, dir1 or dir2 are missing (missing
is not the same as null here)
select occurred_at, dir0 dYear, dir1 dMonth, dir2 dDay from dfs.tmp.`/analytics/processed/test/events`
as t order by occurred_at;
- return 33 records

select occurred_at, dir0 dYear, dir1 dMonth, dir2 dDay from dfs.tmp.`/analytics/processed/test/events`
as t where dirInRange(cast('2015-04-10' as timestamp),cast('2015-07-11' as timestamp),COALESCE(dir0,'-'),COALESCE(dir1,'-'),COALESCE(dir2,'-'))
> '0' order by occurred_at;
- returns 33 record (Coalesce handles the missing values are replaces them with '-')

select occurred_at, dir0 dYear, dir1 dMonth, dir2 dDay from dfs.tmp.`/analytics/processed/test/events`
as t where dirInRange(cast('2015-04-10' as timestamp),cast('2015-07-11' as timestamp),dir0,dir1,dir2)
> '0' order by occurred_at;
- returns 13 records (only those in the deepest directories where dir0, dir1, dir2 are all
set

select occurred_at, dir0 dYear, dir1 dMonth, dir2 dDay, dirInRange(cast('2015-04-10' as timestamp),cast('2015-07-11'
as timestamp),dir0,dir1,dir2) inRange from dfs.tmp.`/analytics/processed/test/events` as t
order by occurred_at;
- return 33 records but 20 of the records will have inRange set to null (the UDF never returns
null so it's being ignored completely)

Lesson: It's not enough to use Nullable*Holder in UDF and have all permutations covered
- Drill will not call the function and fails silently, evaluating the outcome of the function
to null, if any of the dirN parameters are not initialized
2. System.out.print out is the way to get information from within the Drillbit
It would be good to know which Drillbit instance, if many, is responsible for the println
- I don't know how to get the parent drillbit injected into the UDF
3. If directories have numeric names then Drill starts to insist they are all numeric (in
the where condition) event though dirInRange always returns a varchar.
select occurred_at, dir0 dYear, dir1 dMonth, dir2 dDay from dfs.tmp.`/analytics/processed/test/events`
as t where dir0 = dirInRange(cast('2015-04-10' as timestamp),cast('2015-07-11' as timestamp),COALESCE(dir0,'-'),COALESCE(dir1,'-'),COALESCE(dir2,'-'))
order by occurred_at;
"2011,2012" is the name of the directory (same happens with directories ("Q1" and "W1" etc.)
java.lang.RuntimeException: java.sql.SQLException: SYSTEM ERROR: NumberFormatException: 2011,2012
Fragment 0:0
[Error Id: 0c3e1370-ccc5-4288-b6c9-ea0ef4884f1e on localhost:31010]

This seems to fail on the other side where Drill thinks that the outcome of the dirInRange
function is numeric and that the "=" expression is a numerical one.

this runs though: select occurred_at, dir0 dYear, dir1 dMonth, dir2 dDay from dfs.tmp.`/analytics/processed/test/events`
as t where dir0 = trim(dirInRange(cast('2015-04-10' as timestamp),cast('2015-07-11' as timestamp),COALESCE(dir2,'-'),COALESCE(dir2,'-'),COALESCE(dir2,'-')))
order by occurred_at;
- here the trim() function takes care of making sure that the UDF returns a string (I think)
even though the return type of the UDF is always a varcharholder.
4. Directories do not fail and all files and all their records are evaluated
select occurred_at, dir0 dYear, dir1 dMonth, dir2 dDay from dfs.tmp.`/analytics/processed/test/events`
as t where dir0 = trim(dirInRange(cast('2015-04-10' as timestamp),cast('2015-07-11' as timestamp),COALESCE(dir2,'-'),COALESCE(dir2,'-'),COALESCE(dir2,'-')))
order by occurred_at;
- correctly returns no records (return value of dirInRange never matches dir0)

This still evaluates all the records in all the files in all the directories
- no partition pruning is available for selecting from directories
- massive performance/optimization gain to be had
5. File name is not available 
It might be good to have dynamic filename variable available (just as the directory variables)

Point number 4 bugs the living daylights out of me and I hope to share something constructive
on that soon.

There are 3+ Jira items in there somewhere and I will create what I believe to be correct
tasks but please feel free to correct them for me.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message