Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 8D451200B68 for ; Fri, 5 Aug 2016 04:44:22 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 8BB25160AAE; Fri, 5 Aug 2016 02:44:22 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id AC169160AAB for ; Fri, 5 Aug 2016 04:44:21 +0200 (CEST) Received: (qmail 40315 invoked by uid 500); 5 Aug 2016 02:44:20 -0000 Mailing-List: contact issues-help@drill.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@drill.apache.org Delivered-To: mailing list issues@drill.apache.org Received: (qmail 40305 invoked by uid 99); 5 Aug 2016 02:44:20 -0000 Received: from arcas.apache.org (HELO arcas) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 05 Aug 2016 02:44:20 +0000 Received: from arcas.apache.org (localhost [127.0.0.1]) by arcas (Postfix) with ESMTP id A0F8E2C0D5D for ; Fri, 5 Aug 2016 02:44:20 +0000 (UTC) Date: Fri, 5 Aug 2016 02:44:20 +0000 (UTC) From: "ASF GitHub Bot (JIRA)" To: issues@drill.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (DRILL-4704) select statement behavior is inconsistent for decimal values in parquet MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Fri, 05 Aug 2016 02:44:22 -0000 [ https://issues.apache.org/jira/browse/DRILL-4704?page=3Dcom.atlassian= .jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=3D1540= 8789#comment-15408789 ]=20 ASF GitHub Bot commented on DRILL-4704: --------------------------------------- Github user daveoshinsky commented on the issue: https://github.com/apache/drill/pull/517 =20 Thanks for looking at this, Paul. =C2=A0If there are problems with comp= aring an input decimal (from the "where clause", for example) against the d= ata, it is because of the algorithms for comparing decimal values of differ= ent precisions, not because the proper precision was computed for the input= decimal (e.g., based on the actual numerical value, as my PR-517 fix is do= ing when no precision is provided initially). =C2=A0I was going to comment = about the idea of using a fixed value of precision=3D10, but you beat me to= it. =C2=A0It does not suffice for larger integers, like those that require= a "long" instead of an "int" to represent. =C2=A0So, I suggest that we tak= e a "divide and conquer" approach with this problem. =C2=A0First, compute a= proper precision for the value casted from int or BigInt. =C2=A0Next, make= sure the comparison with other decimal values works for any other decimal = value precision (a separate issue). Decimal support in Drill is disabled by default for good reasons. =C2= =A0The (decimal) design itself is fundamentally flawed, and it has all sort= s of issues related to code maintainability as well as runtime performance.= =C2=A0I have been experimenting in a Drill "play build node" with having a= single, one-size-fits-all variable width vector class that represents any = decimal value efficiently (I call it "VarDecimal"). =C2=A0It's difficult to= even get this to compile (with all of the FreeMarker use, and the numerous= conversions between all of the decimal and non-decimal types), but eventua= lly I hope to experiment with this to deal with the DRILL-4184 problem I en= countered a number of months ago (for which I have a pull request PR-372 wi= th a short-term, but not a clean, fix). As food for thought, while looking at this experimental "VarDecimal" th= ing, I noticed that there are (numerous) casts/conversions between all of t= he pairwise combinations of different decimal precisions (DECIMAL9, DECIMAL= 18, DECIMAL28, DECIMAL38, both dense and sparse, from what I recall), and b= etween all of these and all other types that have numeric interpretation. = =C2=A0Eventually replacing all of these different DECIMAL* fixed width repr= esentations with a single VarDecimal variable width representation (one-siz= e-fits-all) would be much more efficient (memory-wise) at runtime (for typi= cal scenarios where most actual numeric values don't require the full preci= sion to represent), would greatly simplify parts of the Drill code, and wou= ld fix DRILL-4184 cleanly.=20 Until some major re-work can be done for Drill decimal support, we will= probably have to settle for small, incremental improvements. =C2=A0I would= say that PR-517 represents one such improvement. =20 On Thursday, August 4, 2016 9:31 PM, Paul Rogers wrote: =20 =20 The plot thickens. I tried the fix of setting the precision to a const= ant of 10. This uncovered a larger issue. The template in question generate= s cast functions for (INTEGER, BIGINT) x (DECIMAL9, DECIMAL18, DECIMAL28) a= nd perhaps others. The constant of 10 does not, of course, work for BIGINT = (long) values.The trick is that precision=3D10 won't work for DECIMAL9 eith= er. Dave's solution has a similar problem. Dave sets the precision to whate= ver is right for the input value, which seems great. But, that value could = be too large for the output DECIMAL type.What we need is to set the precisi= on to the min( max int precision, max decimal precision ). Or, if we use Da= ve's proposed solution, max( input arg precision, max decimal precision ).I= n either case, the code must handle overflow. Passing a Long.MAX_VALUE or e= ven Integer.MAX_VALUE to CastBigIntDecimal9( ) should cause an overflow err= or or data truncation. I'll research how that worked previously to see if w= e've uncovered a new issue, or if a solution already exists.=E2=80=94 You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread. = =20 =20 =20 > select statement behavior is inconsistent for decimal values in parquet > ----------------------------------------------------------------------- > > Key: DRILL-4704 > URL: https://issues.apache.org/jira/browse/DRILL-4704 > Project: Apache Drill > Issue Type: Bug > Components: Functions - Drill > Affects Versions: 1.6.0 > Environment: Windows 7 Pro, Java 1.8.0_91 > Reporter: Dave Oshinsky > Fix For: Future > > > A select statement that searches a parquet file for a decimal value match= ing a specific value behaves inconsistently. The query expressed most simp= ly finds nothing: > 0: jdbc:drill:zk=3Dlocal> select * from dfs.`c:/archiveHR/HR.EMPLOYEES` w= here employee_id =3D 100; > +--------------+-------------+------------+--------+---------------+-----= ------+ > | EMPLOYEE_ID | FIRST_NAME | LAST_NAME | EMAIL | PHONE_NUMBER | HIRE= _DATE | > +--------------+-------------+------------+--------+---------------+-----= ------+ > +--------------+-------------+------------+--------+---------------+-----= ------+ > No rows selected (0.348 seconds) > The query can be modified to find the matching row in a few ways, such as= the following (using between instead of '=3D', changing 100 to 100.0, or c= asting as decimal: > 0: jdbc:drill:zk=3Dlocal> select * from dfs.`c:/archiveHR/HR.EMPLOYEES` w= here employee_id between 100 and 100; > +--------------+-------------+------------+--------+---------------+-----= ------+ > | EMPLOYEE_ID | FIRST_NAME | LAST_NAME | EMAIL | PHONE_NUMBER | = HIR | > +--------------+-------------+------------+--------+---------------+-----= ------+ > | 100 | Steven | King | SKING | 515.123.4567 | 2003= -06-1 | > +--------------+-------------+------------+--------+---------------+-----= ------+ > 1 row selected (0.226 seconds) > 0: jdbc:drill:zk=3Dlocal> select * from dfs.`c:/archiveHR/HR.EMPLOYEES` w= here employee_id =3D 100.0; > +--------------+-------------+------------+--------+---------------+-----= ------+ > | EMPLOYEE_ID | FIRST_NAME | LAST_NAME | EMAIL | PHONE_NUMBER | = HIR | > +--------------+-------------+------------+--------+---------------+-----= ------+ > | 100 | Steven | King | SKING | 515.123.4567 | 2003= -06-1 | > +--------------+-------------+------------+--------+---------------+-----= ------+ > 1 row selected (0.259 seconds) > 0: jdbc:drill:zk=3Dlocal> select * from dfs.`c:/archiveHR/HR.EMPLOYEES` w= here cast(employee_id AS DECIMAL) =3D 100; > +--------------+-------------+------------+--------+---------------+-----= ------+ > | EMPLOYEE_ID | FIRST_NAME | LAST_NAME | EMAIL | PHONE_NUMBER | = HIR | > +--------------+-------------+------------+--------+---------------+-----= ------+ > | 100 | Steven | King | SKING | 515.123.4567 | 2003= -06-1 | > +--------------+-------------+------------+--------+---------------+-----= ------+ > 1 row selected (0.232 seconds) > 0: jdbc:drill:zk=3Dlocal> > The schema of the parquet data that is being searched is as follows: > $ java -jar parquet-tools*1.jar meta c:/archiveHR/HR.EMPLOYEES/1.parquet > file: file:/c:/archiveHR/HR.EMPLOYEES/1.parquet > creator: parquet-mr version 1.8.1 (build 4aba4dae7bb0d4edbcf7923ae= 1339f28fd3f7fcf) > ..... > file schema: HR.EMPLOYEES > -------------------------------------------------------------------------= ------- > EMPLOYEE_ID: REQUIRED FIXED_LEN_BYTE_ARRAY O:DECIMAL R:0 D:0 > FIRST_NAME: OPTIONAL BINARY O:UTF8 R:0 D:1 > LAST_NAME: REQUIRED BINARY O:UTF8 R:0 D:0 > EMAIL: REQUIRED BINARY O:UTF8 R:0 D:0 > PHONE_NUMBER: OPTIONAL BINARY O:UTF8 R:0 D:1 > HIRE_DATE: REQUIRED BINARY O:UTF8 R:0 D:0 > JOB_ID: REQUIRED BINARY O:UTF8 R:0 D:0 > SALARY: OPTIONAL FIXED_LEN_BYTE_ARRAY O:DECIMAL R:0 D:1 > COMMISSION_PCT: OPTIONAL FIXED_LEN_BYTE_ARRAY O:DECIMAL R:0 D:1 > MANAGER_ID: OPTIONAL FIXED_LEN_BYTE_ARRAY O:DECIMAL R:0 D:1 > DEPARTMENT_ID: OPTIONAL FIXED_LEN_BYTE_ARRAY O:DECIMAL R:0 D:1 > row group 1: RC:107 TS:9943 OFFSET:4 > -------------------------------------------------------------------------= ------- > EMPLOYEE_ID: FIXED_LEN_BYTE_ARRAY SNAPPY DO:0 FPO:4 SZ:360/355/0.99 V= C:107 ENC:PLAIN,BIT_PACKED > FIRST_NAME: BINARY SNAPPY DO:0 FPO:364 SZ:902/1058/1.17 VC:107 ENC:P= LAIN_DICTIONARY,RLE,BIT_PACKED > LAST_NAME: BINARY SNAPPY DO:0 FPO:1266 SZ:913/1111/1.22 VC:107 ENC:= PLAIN,BIT_PACKED > EMAIL: BINARY SNAPPY DO:0 FPO:2179 SZ:977/1184/1.21 VC:107 ENC:= PLAIN,BIT_PACKED > PHONE_NUMBER: BINARY SNAPPY DO:0 FPO:3156 SZ:750/1987/2.65 VC:107 ENC:= PLAIN,RLE,BIT_PACKED > HIRE_DATE: BINARY SNAPPY DO:0 FPO:3906 SZ:874/2636/3.02 VC:107 ENC:= PLAIN_DICTIONARY,BIT_PACKED > JOB_ID: BINARY SNAPPY DO:0 FPO:4780 SZ:254/302/1.19 VC:107 ENC:P= LAIN_DICTIONARY,BIT_PACKED > SALARY: FIXED_LEN_BYTE_ARRAY SNAPPY DO:0 FPO:5034 SZ:419/580/1.3= 8 VC:107 ENC:PLAIN,RLE,BIT_PACKED > COMMISSION_PCT: FIXED_LEN_BYTE_ARRAY SNAPPY DO:0 FPO:5453 SZ:97/113/1.16= VC:107 ENC:PLAIN,RLE,BIT_PACKED > MANAGER_ID: FIXED_LEN_BYTE_ARRAY SNAPPY DO:0 FPO:5550 SZ:168/363/2.1= 6 VC:107 ENC:PLAIN,RLE,BIT_PACKED > DEPARTMENT_ID: FIXED_LEN_BYTE_ARRAY SNAPPY DO:0 FPO:5718 SZ:94/254/2.70= VC:107 ENC:PLAIN,RLE,BIT_PACKED -- This message was sent by Atlassian JIRA (v6.3.4#6332)