drill-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Vitalii Diravka (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (DRILL-4763) Parquet file with DATE logical type produces wrong results for simple SELECT
Date Tue, 16 Aug 2016 15:54:20 GMT

    [ https://issues.apache.org/jira/browse/DRILL-4763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15422785#comment-15422785
] 

Vitalii Diravka edited comment on DRILL-4763 at 8/16/16 3:54 PM:
-----------------------------------------------------------------

The following logic is used for parquet dates in drill now (for example to calculate the first
day from unix epoch (1 January 1970)):
{code}
(Julian_day)*2 = unix_first_day
((4713_BE+1970)*365,26)*2 = 4881176
{code}
According to drill and parquet documentations should use the following logic:
{code}
unix_first_day = 0
{code}

*For example:*
Parquet file created from hive:
{code}
hive> select * from test_parquet;
OK
1970-01-05  17:51   Visakh
Time taken: 0.046 seconds, Fetched: 1 row(s)
{code}
{code}
vitalii@vitalii-pc:~/parquet-tools/parquet-mr/parquet-tools/target$ java -jar parquet-tools-1.6.0rc3-SNAPSHOT.jar
cat /tmp/parquetFolder/test_parquet/
dt = 4
tm = 17:51
nm = Visakh
{code}
{code}
Running org.apache.drill.exec.store.parquet.columnreaders.TestDateReader#testParquetDate
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
1 row(s):
-------------------------------------------------------------------------------------------------------------------------------
| dt<DATE(OPTIONAL)>                      | tm<VARCHAR(OPTIONAL)>            
      | nm<VARCHAR(OPTIONAL)>                   |
-------------------------------------------------------------------------------------------------------------------------------
| -11395-10-22T00:00:00.000Z              | 17:51                                   | Visakh
                                 |
-------------------------------------------------------------------------------------------------------------------------------
Total record count: 1
{code}
Parquet file created from drill:
{code}
0: jdbc:drill:zk=local> select * from drill_parquet;
+---------------+-----------------+
| current_date  | unix_first_day  |
+---------------+-----------------+
| 2016-08-15    | 1970-01-01      |
+---------------+-----------------+
1 row selected (0.142 seconds)
{code}

{code}
vitalii@vitalii-pc:~/parquet-tools/parquet-mr/parquet-tools/target$ java -jar parquet-tools-1.6.0rc3-SNAPSHOT.jar
cat /tmp/drill_parquet/
current_date = 4898204
unix_first_day = 4881176
{code}

*With fix:*
{code}
0: jdbc:drill:zk=local> create table drill_parquet_with_fix as SELECT current_date, CAST('1970-01-05'
as date) as unix_fifth_day, CAST('1970-01-01' as date) as unix_first_day FROM (VALUES(1));
+-----------+----------------------------+
| Fragment  | Number of records written  |
+-----------+----------------------------+
| 0_0       | 1                          |
+-----------+----------------------------+
1 row selected (0.257 seconds)
0: jdbc:drill:zk=local> select * from drill_parquet_with_fix;
+---------------+-----------------+-----------------+
| current_date  | unix_fifth_day  | unix_first_day  |
+---------------+-----------------+-----------------+
| 2016-08-15    | 1970-01-05      | 1970-01-01      |
+---------------+-----------------+-----------------+
1 row selected (0.174 seconds)
{code}
{code}
vitalii@vitalii-pc:~/parquet-tools/parquet-mr/parquet-tools/target$ java -jar parquet-tools-1.6.0rc3-SNAPSHOT.jar
cat /tmp/drill_parquet_with_fix
current_date = 17028
unix_fifth_day = 4
unix_first_day = 0
{code}


was (Author: vitalii):
The following logic is used to calculate parquet date in drill now:
{code}
(Julian_day)*2 = unix_first_day
((4713_BE+1970)*365,26)*2 = 4881176
{code}
According to drill doc should use the following logic:
{code}
(Julian_day +_1970)*365,26 = unix_first_day
(4713_BE+1970)*365,26 = 2457615
{code}
According to parquet doc should use the following logic which is the right case:
{code}
unix_first_day = 0
{code}


*For example:*
Parquet file created from hive:
{code}
hive> select * from test_parquet;
OK
1970-01-05  17:51   Visakh
Time taken: 0.046 seconds, Fetched: 1 row(s)
{code}
{code}
vitalii@vitalii-pc:~/parquet-tools/parquet-mr/parquet-tools/target$ java -jar parquet-tools-1.6.0rc3-SNAPSHOT.jar
cat /tmp/parquetFolder/test_parquet/
dt = 4
tm = 17:51
nm = Visakh
{code}
{code}
Running org.apache.drill.exec.store.parquet.columnreaders.TestDateReader#testParquetDate
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
1 row(s):
-------------------------------------------------------------------------------------------------------------------------------
| dt<DATE(OPTIONAL)>                      | tm<VARCHAR(OPTIONAL)>            
      | nm<VARCHAR(OPTIONAL)>                   |
-------------------------------------------------------------------------------------------------------------------------------
| -11395-10-22T00:00:00.000Z              | 17:51                                   | Visakh
                                 |
-------------------------------------------------------------------------------------------------------------------------------
Total record count: 1
{code}
Parquet file created from drill:
{code}
0: jdbc:drill:zk=local> select * from drill_parquet;
+---------------+-----------------+
| current_date  | unix_first_day  |
+---------------+-----------------+
| 2016-08-15    | 1970-01-01      |
+---------------+-----------------+
1 row selected (0.142 seconds)
{code}

{code}
vitalii@vitalii-pc:~/parquet-tools/parquet-mr/parquet-tools/target$ java -jar parquet-tools-1.6.0rc3-SNAPSHOT.jar
cat /tmp/drill_parquet/
current_date = 4898204
unix_first_day = 4881176
{code}

*With fix:*
{code}
0: jdbc:drill:zk=local> create table drill_parquet_with_fix as SELECT current_date, CAST('1970-01-05'
as date) as unix_fifth_day, CAST('1970-01-01' as date) as unix_first_day FROM (VALUES(1));
+-----------+----------------------------+
| Fragment  | Number of records written  |
+-----------+----------------------------+
| 0_0       | 1                          |
+-----------+----------------------------+
1 row selected (0.257 seconds)
0: jdbc:drill:zk=local> select * from drill_parquet_with_fix;
+---------------+-----------------+-----------------+
| current_date  | unix_fifth_day  | unix_first_day  |
+---------------+-----------------+-----------------+
| 2016-08-15    | 1970-01-05      | 1970-01-01      |
+---------------+-----------------+-----------------+
1 row selected (0.174 seconds)
{code}
{code}
vitalii@vitalii-pc:~/parquet-tools/parquet-mr/parquet-tools/target$ java -jar parquet-tools-1.6.0rc3-SNAPSHOT.jar
cat /tmp/drill_parquet_with_fix
current_date = 17028
unix_fifth_day = 4
unix_first_day = 0
{code}

> Parquet file with DATE logical type produces wrong results for simple SELECT
> ----------------------------------------------------------------------------
>
>                 Key: DRILL-4763
>                 URL: https://issues.apache.org/jira/browse/DRILL-4763
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Execution - Data Types
>    Affects Versions: 1.6.0
>            Reporter: Paul Rogers
>            Assignee: Vitalii Diravka
>         Attachments: date.parquet, int_16.parquet
>
>
> Created a simple Parquet file with the following schema:
> message test { required int32 index; required int32 value (DATE); required int32 raw;
}
> That is, a file with an int32 storage type and a DATE logical type. Then, created a number
of test values:
> 0 (which should be interpreted as 1970-01-01) and
> (int) (System.currentTimeMillis() / (24*60*60*1000) ) Which should be interpreted as
the number of days since 1970-01-01 and today.
> According to the Parquet spec (https://github.com/Parquet/parquet-format/blob/master/LogicalTypes.md),
Parquet dates are expressed as "the number of days from the Unix epoch, 1 January 1970."
> Java timestamps are expressed as "measured in milliseconds, between the current time
and midnight, January 1, 1970 UTC."
> There is ambiguity here: Parquet dates are presumably local times not absolute times,
so the math above will actually tell us the date in London right now, but that's close enough.
> Generate the local file to date.parquet. Query it with:
> SELECT * from `local`.`root`.`date.parquet`;
> The results are incorrect:
> index value raw
> 1	-11395-10-18T00:00:00.000-07:52:58	0
> Here, we have a value of 0. The displayed date is decidedly not 1970-01-01T00:00:00.
We actually have many problems:
> 1. The date is far off.
> 2. The output shows time. But, the Parquet DATE format explcitly does NOT include time,
so it makes no sense to include it.
> 3. The output attempts to show a time zone, but a time zone of -07:52:58, while close
to PST, is not right (there is no timezine that is of by 7:02 from UTC.)
> 4. The data has no time zone, Parquet DATE explicilty is a local time, so it is impossible
to know the relationship between that date an UTC.
> The correct output (in ISO format) would be: 1970-01-01
> The last line should be today's date, but instead is:
> 6	-11348-04-20T00:00:00.000-07:52:58	16986
> Expected:
> 2016-07-04
> Note that all the information to produce the right information is available to Drill:
> 1. The DATE annotation says the meaning of the signed 32-bit integer.
> 2. Given the starting point and duration in days, the conversion to Drill's own internal
date format is unambiguous.
> 3. The DATE annotation says that the date is local, so Drill should not attempt to convert
to UTC. (That is, a Java Date object can't be used, instead a Joda/Java 8 LocalDate is necessary.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message