Return-Path: X-Original-To: apmail-drill-dev-archive@www.apache.org Delivered-To: apmail-drill-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id E373F18A04 for ; Mon, 27 Jul 2015 06:49:04 +0000 (UTC) Received: (qmail 1098 invoked by uid 500); 27 Jul 2015 06:49:04 -0000 Delivered-To: apmail-drill-dev-archive@drill.apache.org Received: (qmail 1040 invoked by uid 500); 27 Jul 2015 06:49:04 -0000 Mailing-List: contact dev-help@drill.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@drill.apache.org Delivered-To: mailing list dev@drill.apache.org Received: (qmail 994 invoked by uid 99); 27 Jul 2015 06:49:04 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 27 Jul 2015 06:49:04 +0000 Date: Mon, 27 Jul 2015 06:49:04 +0000 (UTC) From: =?utf-8?Q?Stef=C3=A1n_Baxter_=28JIRA=29?= To: dev@drill.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Created] (DRILL-3561) The return type of UDF seems ambigious and varchar results can force arithmetic comparison MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 Stef=C3=A1n Baxter created DRILL-3561: ------------------------------------ Summary: The return type of UDF seems ambigious and varchar re= sults can force arithmetic comparison Key: DRILL-3561 URL: https://issues.apache.org/jira/browse/DRILL-3561 Project: Apache Drill Issue Type: Bug Components: SQL Parser Reporter: Stef=C3=A1n Baxter Assignee: Aman Sinha Please see information in the following user group email where dir0, contai= ning "2011-2012" is being compared to a varchar/null result of a UDF and Dr= ill tries to convert dir0 to a number and fails. =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D I'm still working on our evaluation and now focusing on directory based que= ries and mixing directory and parquet based partitions. This is also a continued trip down the UDF rabbit hole :) (some pun intende= d) I continue to come across things that surprise me and I would like to share= them with both the developers, that might want to address some of them, an= d other newcomers that might benefit from them. The UDF code referenced here van be found at (https://github.com/acmeguy/as= drill) and the documents and the directory structure used in these examples= are included in the tiny attachment. I will try to keep this as brief as possible. What you need to know is: there are 33 records in the 19 files in a mixed directory structure=20 - see zip for details each record contains a date that is valid within that directory structure the dirInRange function is a UDF that takes a date range and directory info= rmation to determine if a directory contains target data for the provided d= ate range - see github for details the dirInRange function should be able to accept all null values or missing= parameters for everything other than the time range starts parameter the dirInRange function returns a string with a number that represents the = number of parameters used (function variant) - has no other purpose/function at this point - will return the value of the last dirN paramater that is not null (dir0, = dir1 or dir2) Observations 1. The UDF function (dirInRange) is not called if dir0, dir1 or dir2 are mi= ssing (missing is not the same as null here) select occurred_at, dir0 dYear, dir1 dMonth, dir2 dDay from dfs.tmp.`/analy= tics/processed/test/events` as t order by occurred_at; - return 33 records select occurred_at, dir0 dYear, dir1 dMonth, dir2 dDay from dfs.tmp.`/analy= tics/processed/test/events` as t where dirInRange(cast('2015-04-10' as time= stamp),cast('2015-07-11' as timestamp),COALESCE(dir0,'-'),COALESCE(dir1,'-'= ),COALESCE(dir2,'-')) > '0' order by occurred_at; - returns 33 record (Coalesce handles the missing values are replaces them = with '-') select occurred_at, dir0 dYear, dir1 dMonth, dir2 dDay from dfs.tmp.`/analy= tics/processed/test/events` as t where dirInRange(cast('2015-04-10' as time= stamp),cast('2015-07-11' as timestamp),dir0,dir1,dir2) > '0' order by occur= red_at; - returns 13 records (only those in the deepest directories where dir0, dir= 1, dir2 are all set select occurred_at, dir0 dYear, dir1 dMonth, dir2 dDay, dirInRange(cast('20= 15-04-10' as timestamp),cast('2015-07-11' as timestamp),dir0,dir1,dir2) inR= ange from dfs.tmp.`/analytics/processed/test/events` as t order by occurred= _at; - return 33 records but 20 of the records will have inRange set to null (th= e UDF never returns null so it's being ignored completely) Lesson: It's not enough to use Nullable*Holder in UDF and have all permutat= ions covered - Drill will not call the function and fails silently, evaluating the outco= me of the function to null, if any of the dirN parameters are not initializ= ed 2. System.out.print out is the way to get information from within the Drill= bit It would be good to know which Drillbit instance, if many, is responsible f= or the println - I don't know how to get the parent drillbit injected into the UDF 3. If directories have numeric names then Drill starts to insist they are a= ll numeric (in the where condition) event though dirInRange always returns = a varchar. select occurred_at, dir0 dYear, dir1 dMonth, dir2 dDay from dfs.tmp.`/analy= tics/processed/test/events` as t where dir0 =3D dirInRange(cast('2015-04-10= ' as timestamp),cast('2015-07-11' as timestamp),COALESCE(dir0,'-'),COALESCE= (dir1,'-'),COALESCE(dir2,'-')) order by occurred_at; "2011,2012" is the name of the directory (same happens with directories ("Q= 1" and "W1" etc.) java.lang.RuntimeException: java.sql.SQLException: SYSTEM ERROR: NumberForm= atException: 2011,2012 Fragment 0:0 [Error Id: 0c3e1370-ccc5-4288-b6c9-ea0ef4884f1e on localhost:31010] This seems to fail on the other side where Drill thinks that the outcome of= the dirInRange function is numeric and that the "=3D" expression is a nume= rical one. this runs though: select occurred_at, dir0 dYear, dir1 dMonth, dir2 dDay fr= om dfs.tmp.`/analytics/processed/test/events` as t where dir0 =3D trim(dirI= nRange(cast('2015-04-10' as timestamp),cast('2015-07-11' as timestamp),COAL= ESCE(dir2,'-'),COALESCE(dir2,'-'),COALESCE(dir2,'-'))) order by occurred_at= ; - here the trim() function takes care of making sure that the UDF returns a= string (I think) even though the return type of the UDF is always a varcha= rholder. 4. Directories do not fail and all files and all their records are evaluate= d select occurred_at, dir0 dYear, dir1 dMonth, dir2 dDay from dfs.tmp.`/analy= tics/processed/test/events` as t where dir0 =3D trim(dirInRange(cast('2015-= 04-10' as timestamp),cast('2015-07-11' as timestamp),COALESCE(dir2,'-'),COA= LESCE(dir2,'-'),COALESCE(dir2,'-'))) order by occurred_at; - correctly returns no records (return value of dirInRange never matches di= r0) This still evaluates all the records in all the files in all the directorie= s - no partition pruning is available for selecting from directories - massive performance/optimization gain to be had 5. File name is not available=20 It might be good to have dynamic filename variable available (just as the d= irectory variables) Point number 4 bugs the living daylights out of me and I hope to share some= thing constructive on that soon. There are 3+ Jira items in there somewhere and I will create what I believe= to be correct tasks but please feel free to correct them for me. -- This message was sent by Atlassian JIRA (v6.3.4#6332)