drill-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (DRILL-4704) select statement behavior is inconsistent for decimal values in parquet
Date Fri, 05 Aug 2016 02:44:20 GMT

    [ https://issues.apache.org/jira/browse/DRILL-4704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15408789#comment-15408789
] 

ASF GitHub Bot commented on DRILL-4704:
---------------------------------------

Github user daveoshinsky commented on the issue:

    https://github.com/apache/drill/pull/517
  
    Thanks for looking at this, Paul.  If there are problems with comparing an input decimal
(from the "where clause", for example) against the data, it is because of the algorithms for
comparing decimal values of different precisions, not because the proper precision was computed
for the input decimal (e.g., based on the actual numerical value, as my PR-517 fix is doing
when no precision is provided initially).  I was going to comment about the idea of using
a fixed value of precision=10, but you beat me to it.  It does not suffice for larger integers,
like those that require a "long" instead of an "int" to represent.  So, I suggest that we
take a "divide and conquer" approach with this problem.  First, compute a proper precision
for the value casted from int or BigInt.  Next, make sure the comparison with other decimal
values works for any other decimal value precision (a separate issue).
    Decimal support in Drill is disabled by default for good reasons.  The (decimal) design
itself is fundamentally flawed, and it has all sorts of issues related to code maintainability
as well as runtime performance.  I have been experimenting in a Drill "play build node" with
having a single, one-size-fits-all variable width vector class that represents any decimal
value efficiently (I call it "VarDecimal").  It's difficult to even get this to compile (with
all of the FreeMarker use, and the numerous conversions between all of the decimal and non-decimal
types), but eventually I hope to experiment with this to deal with the DRILL-4184 problem
I encountered a number of months ago (for which I have a pull request PR-372 with a short-term,
but not a clean, fix).
    As food for thought, while looking at this experimental "VarDecimal" thing, I noticed
that there are (numerous) casts/conversions between all of the pairwise combinations of different
decimal precisions (DECIMAL9, DECIMAL18, DECIMAL28, DECIMAL38, both dense and sparse, from
what I recall), and between all of these and all other types that have numeric interpretation.
 Eventually replacing all of these different DECIMAL* fixed width representations with a
single VarDecimal variable width representation (one-size-fits-all) would be much more efficient
(memory-wise) at runtime (for typical scenarios where most actual numeric values don't require
the full precision to represent), would greatly simplify parts of the Drill code, and would
fix DRILL-4184 cleanly. 
    Until some major re-work can be done for Drill decimal support, we will probably have
to settle for small, incremental improvements.  I would say that PR-517 represents one such
improvement.
    
        On Thursday, August 4, 2016 9:31 PM, Paul Rogers <notifications@github.com>
wrote:
     
    
     The plot thickens. I tried the fix of setting the precision to a constant of 10. This
uncovered a larger issue. The template in question generates cast functions for (INTEGER,
BIGINT) x (DECIMAL9, DECIMAL18, DECIMAL28) and perhaps others. The constant of 10 does not,
of course, work for BIGINT (long) values.The trick is that precision=10 won't work for DECIMAL9
either. Dave's solution has a similar problem. Dave sets the precision to whatever is right
for the input value, which seems great. But, that value could be too large for the output
DECIMAL type.What we need is to set the precision to the min( max int precision, max decimal
precision ). Or, if we use Dave's proposed solution, max( input arg precision, max decimal
precision ).In either case, the code must handle overflow. Passing a Long.MAX_VALUE or even
Integer.MAX_VALUE to CastBigIntDecimal9( ) should cause an overflow error or data truncation.
I'll research how that worked previously to see if we've uncovered a new issue, or if a solution
already exists.—
    You are receiving this because you authored the thread.
    Reply to this email directly, view it on GitHub, or mute the thread.  
    
      


> select statement behavior is inconsistent for decimal values in parquet
> -----------------------------------------------------------------------
>
>                 Key: DRILL-4704
>                 URL: https://issues.apache.org/jira/browse/DRILL-4704
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Functions - Drill
>    Affects Versions: 1.6.0
>         Environment: Windows 7 Pro, Java 1.8.0_91
>            Reporter: Dave Oshinsky
>             Fix For: Future
>
>
> A select statement that searches a parquet file for a decimal value matching a specific
value behaves inconsistently.  The query expressed most simply finds nothing:
> 0: jdbc:drill:zk=local> select * from dfs.`c:/archiveHR/HR.EMPLOYEES` where employee_id
= 100;
> +--------------+-------------+------------+--------+---------------+-----------+
> | EMPLOYEE_ID  | FIRST_NAME  | LAST_NAME  | EMAIL  | PHONE_NUMBER  | HIRE_DATE |
> +--------------+-------------+------------+--------+---------------+-----------+
> +--------------+-------------+------------+--------+---------------+-----------+
> No rows selected (0.348 seconds)
> The query can be modified to find the matching row in a few ways, such as the following
(using between instead of '=', changing 100 to 100.0, or casting as decimal:
> 0: jdbc:drill:zk=local> select * from dfs.`c:/archiveHR/HR.EMPLOYEES` where employee_id
between 100 and 100;
> +--------------+-------------+------------+--------+---------------+-----------+
> | EMPLOYEE_ID  | FIRST_NAME  | LAST_NAME  | EMAIL  | PHONE_NUMBER  |       HIR |
> +--------------+-------------+------------+--------+---------------+-----------+
> | 100          | Steven      | King       | SKING  | 515.123.4567  | 2003-06-1 |
> +--------------+-------------+------------+--------+---------------+-----------+
> 1 row selected (0.226 seconds)
> 0: jdbc:drill:zk=local> select * from dfs.`c:/archiveHR/HR.EMPLOYEES` where employee_id
= 100.0;
> +--------------+-------------+------------+--------+---------------+-----------+
> | EMPLOYEE_ID  | FIRST_NAME  | LAST_NAME  | EMAIL  | PHONE_NUMBER  |       HIR |
> +--------------+-------------+------------+--------+---------------+-----------+
> | 100          | Steven      | King       | SKING  | 515.123.4567  | 2003-06-1 |
> +--------------+-------------+------------+--------+---------------+-----------+
> 1 row selected (0.259 seconds)
> 0: jdbc:drill:zk=local> select * from dfs.`c:/archiveHR/HR.EMPLOYEES` where cast(employee_id
AS DECIMAL) = 100;
> +--------------+-------------+------------+--------+---------------+-----------+
> | EMPLOYEE_ID  | FIRST_NAME  | LAST_NAME  | EMAIL  | PHONE_NUMBER  |       HIR |
> +--------------+-------------+------------+--------+---------------+-----------+
> | 100          | Steven      | King       | SKING  | 515.123.4567  | 2003-06-1 |
> +--------------+-------------+------------+--------+---------------+-----------+
> 1 row selected (0.232 seconds)
> 0: jdbc:drill:zk=local>
> The schema of the parquet data that is being searched is as follows:
> $ java -jar parquet-tools*1.jar meta c:/archiveHR/HR.EMPLOYEES/1.parquet
> file:           file:/c:/archiveHR/HR.EMPLOYEES/1.parquet
> creator:        parquet-mr version 1.8.1 (build 4aba4dae7bb0d4edbcf7923ae1339f28fd3f7fcf)
> .....
> file schema:    HR.EMPLOYEES
> --------------------------------------------------------------------------------
> EMPLOYEE_ID:    REQUIRED FIXED_LEN_BYTE_ARRAY O:DECIMAL R:0 D:0
> FIRST_NAME:     OPTIONAL BINARY O:UTF8 R:0 D:1
> LAST_NAME:      REQUIRED BINARY O:UTF8 R:0 D:0
> EMAIL:          REQUIRED BINARY O:UTF8 R:0 D:0
> PHONE_NUMBER:   OPTIONAL BINARY O:UTF8 R:0 D:1
> HIRE_DATE:      REQUIRED BINARY O:UTF8 R:0 D:0
> JOB_ID:         REQUIRED BINARY O:UTF8 R:0 D:0
> SALARY:         OPTIONAL FIXED_LEN_BYTE_ARRAY O:DECIMAL R:0 D:1
> COMMISSION_PCT: OPTIONAL FIXED_LEN_BYTE_ARRAY O:DECIMAL R:0 D:1
> MANAGER_ID:     OPTIONAL FIXED_LEN_BYTE_ARRAY O:DECIMAL R:0 D:1
> DEPARTMENT_ID:  OPTIONAL FIXED_LEN_BYTE_ARRAY O:DECIMAL R:0 D:1
> row group 1:    RC:107 TS:9943 OFFSET:4
> --------------------------------------------------------------------------------
> EMPLOYEE_ID:     FIXED_LEN_BYTE_ARRAY SNAPPY DO:0 FPO:4 SZ:360/355/0.99 VC:107 ENC:PLAIN,BIT_PACKED
> FIRST_NAME:      BINARY SNAPPY DO:0 FPO:364 SZ:902/1058/1.17 VC:107 ENC:PLAIN_DICTIONARY,RLE,BIT_PACKED
> LAST_NAME:       BINARY SNAPPY DO:0 FPO:1266 SZ:913/1111/1.22 VC:107 ENC:PLAIN,BIT_PACKED
> EMAIL:           BINARY SNAPPY DO:0 FPO:2179 SZ:977/1184/1.21 VC:107 ENC:PLAIN,BIT_PACKED
> PHONE_NUMBER:    BINARY SNAPPY DO:0 FPO:3156 SZ:750/1987/2.65 VC:107 ENC:PLAIN,RLE,BIT_PACKED
> HIRE_DATE:       BINARY SNAPPY DO:0 FPO:3906 SZ:874/2636/3.02 VC:107 ENC:PLAIN_DICTIONARY,BIT_PACKED
> JOB_ID:          BINARY SNAPPY DO:0 FPO:4780 SZ:254/302/1.19 VC:107 ENC:PLAIN_DICTIONARY,BIT_PACKED
> SALARY:          FIXED_LEN_BYTE_ARRAY SNAPPY DO:0 FPO:5034 SZ:419/580/1.38 VC:107 ENC:PLAIN,RLE,BIT_PACKED
> COMMISSION_PCT:  FIXED_LEN_BYTE_ARRAY SNAPPY DO:0 FPO:5453 SZ:97/113/1.16 VC:107 ENC:PLAIN,RLE,BIT_PACKED
> MANAGER_ID:      FIXED_LEN_BYTE_ARRAY SNAPPY DO:0 FPO:5550 SZ:168/363/2.16 VC:107 ENC:PLAIN,RLE,BIT_PACKED
> DEPARTMENT_ID:   FIXED_LEN_BYTE_ARRAY SNAPPY DO:0 FPO:5718 SZ:94/254/2.70 VC:107 ENC:PLAIN,RLE,BIT_PACKED



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message