drill-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (DRILL-4704) select statement behavior is inconsistent for decimal values in parquet
Date Tue, 12 Jul 2016 18:27:20 GMT

    [ https://issues.apache.org/jira/browse/DRILL-4704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15373401#comment-15373401
] 

ASF GitHub Bot commented on DRILL-4704:
---------------------------------------

Github user daveoshinsky commented on a diff in the pull request:

    https://github.com/apache/drill/pull/517#discussion_r70493981
  
    --- Diff: exec/java-exec/src/main/codegen/templates/Decimal/CastIntDecimal.java ---
    @@ -64,7 +64,15 @@ public void setup() {
     
         public void eval() {
             out.scale = (int) scale.value;
    -        out.precision = (int) precision.value;
    --- End diff --
    
    I tried doing it in the setup method, and it did not work. It seems the integer value
to be casted was not available at that time.  Since the number of divisions by 10 (these
are cheap integer divisions) is proportional to log10(the integer number), I don't see how
this is expensive, especially compared to the expense of reading the rows from whatever device.

    
        On Tuesday, July 12, 2016 1:56 PM, Aman Sinha <notifications@github.com> wrote:
     
    
     In exec/java-exec/src/main/codegen/templates/Decimal/CastIntDecimal.java:> @@ -64,7
+64,15 @@ public void setup() {
    >  
    >      public void eval() {
    >          out.scale = (int) scale.value;
    > -        out.precision = (int) precision.value;
    The eval() method is called for every row, so there will be a performance overhead. I
agree about the storage cost if we assume worst case precision; however the per-row evaluation
should be as lightweight as possible and likely outweighs the storage consideration. One option
is to treat the IntDecimal and BigIntDecimal separately. For IntDecimal the max int value
is 2^31 - 1 which has a max precision of 10. Similarly, there is an upper bound on the BigIntDecimal.
You could initialize the precision in the setup() method which is called once per new schema
(in the normal case the schema of that column does not change, this would avoid unnecessarily
computing the precision). —
    You are receiving this because you authored the thread.
    Reply to this email directly, view it on GitHub, or mute the thread.  
    
      


> select statement behavior is inconsistent for decimal values in parquet
> -----------------------------------------------------------------------
>
>                 Key: DRILL-4704
>                 URL: https://issues.apache.org/jira/browse/DRILL-4704
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Functions - Drill
>    Affects Versions: 1.6.0
>         Environment: Windows 7 Pro, Java 1.8.0_91
>            Reporter: Dave Oshinsky
>             Fix For: Future
>
>
> A select statement that searches a parquet file for a decimal value matching a specific
value behaves inconsistently.  The query expressed most simply finds nothing:
> 0: jdbc:drill:zk=local> select * from dfs.`c:/archiveHR/HR.EMPLOYEES` where employee_id
= 100;
> +--------------+-------------+------------+--------+---------------+-----------+
> | EMPLOYEE_ID  | FIRST_NAME  | LAST_NAME  | EMAIL  | PHONE_NUMBER  | HIRE_DATE |
> +--------------+-------------+------------+--------+---------------+-----------+
> +--------------+-------------+------------+--------+---------------+-----------+
> No rows selected (0.348 seconds)
> The query can be modified to find the matching row in a few ways, such as the following
(using between instead of '=', changing 100 to 100.0, or casting as decimal:
> 0: jdbc:drill:zk=local> select * from dfs.`c:/archiveHR/HR.EMPLOYEES` where employee_id
between 100 and 100;
> +--------------+-------------+------------+--------+---------------+-----------+
> | EMPLOYEE_ID  | FIRST_NAME  | LAST_NAME  | EMAIL  | PHONE_NUMBER  |       HIR |
> +--------------+-------------+------------+--------+---------------+-----------+
> | 100          | Steven      | King       | SKING  | 515.123.4567  | 2003-06-1 |
> +--------------+-------------+------------+--------+---------------+-----------+
> 1 row selected (0.226 seconds)
> 0: jdbc:drill:zk=local> select * from dfs.`c:/archiveHR/HR.EMPLOYEES` where employee_id
= 100.0;
> +--------------+-------------+------------+--------+---------------+-----------+
> | EMPLOYEE_ID  | FIRST_NAME  | LAST_NAME  | EMAIL  | PHONE_NUMBER  |       HIR |
> +--------------+-------------+------------+--------+---------------+-----------+
> | 100          | Steven      | King       | SKING  | 515.123.4567  | 2003-06-1 |
> +--------------+-------------+------------+--------+---------------+-----------+
> 1 row selected (0.259 seconds)
> 0: jdbc:drill:zk=local> select * from dfs.`c:/archiveHR/HR.EMPLOYEES` where cast(employee_id
AS DECIMAL) = 100;
> +--------------+-------------+------------+--------+---------------+-----------+
> | EMPLOYEE_ID  | FIRST_NAME  | LAST_NAME  | EMAIL  | PHONE_NUMBER  |       HIR |
> +--------------+-------------+------------+--------+---------------+-----------+
> | 100          | Steven      | King       | SKING  | 515.123.4567  | 2003-06-1 |
> +--------------+-------------+------------+--------+---------------+-----------+
> 1 row selected (0.232 seconds)
> 0: jdbc:drill:zk=local>
> The schema of the parquet data that is being searched is as follows:
> $ java -jar parquet-tools*1.jar meta c:/archiveHR/HR.EMPLOYEES/1.parquet
> file:           file:/c:/archiveHR/HR.EMPLOYEES/1.parquet
> creator:        parquet-mr version 1.8.1 (build 4aba4dae7bb0d4edbcf7923ae1339f28fd3f7fcf)
> .....
> file schema:    HR.EMPLOYEES
> --------------------------------------------------------------------------------
> EMPLOYEE_ID:    REQUIRED FIXED_LEN_BYTE_ARRAY O:DECIMAL R:0 D:0
> FIRST_NAME:     OPTIONAL BINARY O:UTF8 R:0 D:1
> LAST_NAME:      REQUIRED BINARY O:UTF8 R:0 D:0
> EMAIL:          REQUIRED BINARY O:UTF8 R:0 D:0
> PHONE_NUMBER:   OPTIONAL BINARY O:UTF8 R:0 D:1
> HIRE_DATE:      REQUIRED BINARY O:UTF8 R:0 D:0
> JOB_ID:         REQUIRED BINARY O:UTF8 R:0 D:0
> SALARY:         OPTIONAL FIXED_LEN_BYTE_ARRAY O:DECIMAL R:0 D:1
> COMMISSION_PCT: OPTIONAL FIXED_LEN_BYTE_ARRAY O:DECIMAL R:0 D:1
> MANAGER_ID:     OPTIONAL FIXED_LEN_BYTE_ARRAY O:DECIMAL R:0 D:1
> DEPARTMENT_ID:  OPTIONAL FIXED_LEN_BYTE_ARRAY O:DECIMAL R:0 D:1
> row group 1:    RC:107 TS:9943 OFFSET:4
> --------------------------------------------------------------------------------
> EMPLOYEE_ID:     FIXED_LEN_BYTE_ARRAY SNAPPY DO:0 FPO:4 SZ:360/355/0.99 VC:107 ENC:PLAIN,BIT_PACKED
> FIRST_NAME:      BINARY SNAPPY DO:0 FPO:364 SZ:902/1058/1.17 VC:107 ENC:PLAIN_DICTIONARY,RLE,BIT_PACKED
> LAST_NAME:       BINARY SNAPPY DO:0 FPO:1266 SZ:913/1111/1.22 VC:107 ENC:PLAIN,BIT_PACKED
> EMAIL:           BINARY SNAPPY DO:0 FPO:2179 SZ:977/1184/1.21 VC:107 ENC:PLAIN,BIT_PACKED
> PHONE_NUMBER:    BINARY SNAPPY DO:0 FPO:3156 SZ:750/1987/2.65 VC:107 ENC:PLAIN,RLE,BIT_PACKED
> HIRE_DATE:       BINARY SNAPPY DO:0 FPO:3906 SZ:874/2636/3.02 VC:107 ENC:PLAIN_DICTIONARY,BIT_PACKED
> JOB_ID:          BINARY SNAPPY DO:0 FPO:4780 SZ:254/302/1.19 VC:107 ENC:PLAIN_DICTIONARY,BIT_PACKED
> SALARY:          FIXED_LEN_BYTE_ARRAY SNAPPY DO:0 FPO:5034 SZ:419/580/1.38 VC:107 ENC:PLAIN,RLE,BIT_PACKED
> COMMISSION_PCT:  FIXED_LEN_BYTE_ARRAY SNAPPY DO:0 FPO:5453 SZ:97/113/1.16 VC:107 ENC:PLAIN,RLE,BIT_PACKED
> MANAGER_ID:      FIXED_LEN_BYTE_ARRAY SNAPPY DO:0 FPO:5550 SZ:168/363/2.16 VC:107 ENC:PLAIN,RLE,BIT_PACKED
> DEPARTMENT_ID:   FIXED_LEN_BYTE_ARRAY SNAPPY DO:0 FPO:5718 SZ:94/254/2.70 VC:107 ENC:PLAIN,RLE,BIT_PACKED



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message