Return-Path: X-Original-To: apmail-drill-issues-archive@minotaur.apache.org Delivered-To: apmail-drill-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 52F4718B0F for ; Wed, 27 Jan 2016 18:57:02 +0000 (UTC) Received: (qmail 48703 invoked by uid 500); 27 Jan 2016 18:56:40 -0000 Delivered-To: apmail-drill-issues-archive@drill.apache.org Received: (qmail 48576 invoked by uid 500); 27 Jan 2016 18:56:40 -0000 Mailing-List: contact issues-help@drill.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@drill.apache.org Delivered-To: mailing list issues@drill.apache.org Received: (qmail 48339 invoked by uid 99); 27 Jan 2016 18:56:40 -0000 Received: from arcas.apache.org (HELO arcas) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 27 Jan 2016 18:56:40 +0000 Received: from arcas.apache.org (localhost [127.0.0.1]) by arcas (Postfix) with ESMTP id D9FB52C1F5C for ; Wed, 27 Jan 2016 18:56:39 +0000 (UTC) Date: Wed, 27 Jan 2016 18:56:39 +0000 (UTC) From: "ASF GitHub Bot (JIRA)" To: issues@drill.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (DRILL-4203) Parquet File : Date is stored wrongly MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/DRILL-4203?page=3Dcom.atlassian= .jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=3D1511= 9967#comment-15119967 ]=20 ASF GitHub Bot commented on DRILL-4203: --------------------------------------- GitHub user jaltekruse opened a pull request: https://github.com/apache/drill/pull/341 DRILL-4203: fix dates written into parquet files to conform to parquet = format spec This branch includes an update of the version number to 1.5.0, this is = required because we need a hard release to signal that all future parquet f= iles are not corrupted. Without this change the fixed files written by the = writer would still be considered corrupt (as all of the rest of the files g= enerated with earlier commits with the version 1.5.0-SNAPSHOT will actually= be corrupted). This commit can be removed/amended when the changes are mer= ged, but this patch should be immediately followed by a change of the versi= on number to avoid the risk of generating files with corrected date values,= but a version number that will tell the reader to still shift the dates. You can merge this pull request into a Git repository by running: $ git pull https://github.com/jaltekruse/incubator-drill 4203-parquet-d= ates-bug-squash2 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/drill/pull/341.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #341 =20 ---- commit 3cbbe1c418ec8e802144f6cba1d88ede9de7f930 Author: Jason Altekruse Date: 2015-12-31T16:22:04Z DRILL-4203: Fix date values written in parquet files created by Drill =20 Drill was writing non-standard dates into parquet files for all release= s before 1.5.0. The values have been read by Drill correctly by Drill, bu= t external tools like Spark reading the files will see corrupted values f= or all dates that have been written by Drill. =20 This change corrects the behavior of the Drill parquet writer to correc= tly store dates in the format given in the parquet specification. =20 To maintain compatibility with old files, the parquet reader code has been updated to check for the old format and automatically shift the corrupted values into corrected ones automatically. =20 The test cases included here should ensure that all files produced by historical versions of Drill will continue to return the same values th= ey had in previous releases. For compatibility with external tools, any ol= d files with corrupted dates can be re-written using the CREATE TABLE AS command (as the writer will now only produce the specification-complian= t values, even if after reading out of older corrupt files). =20 While the old behavior was a consistent shift into an unlikely range to be used in a modern database (over 10,000 years in the future), thes= e are still valid date values. In the case where these may have been written into files intentionally, and we cannot be certain from the metadata if Dril= l produced the files, an option is included to turn off the auto-correcti= on. Use of this option is assumed to be extremely unlikely, but it is inclu= ded for completeness. commit 9a3f3b8a3d599d3e8981c7b987f229809db8eec4 Author: Jason Altekruse Date: 2016-01-27T18:20:01Z Fix DrillVersionInfo to make it provide a valid version number even dur= ing the unit tests. =20 This is now a build-time generated class, rather than one that looks on= the classpath for META-INF files. =20 This pattern for file generation with parameters passed from the POM fi= les was borrowed from parquet-mr. commit fb4bc2271c625dd25729575fc77f117b2c1d0a72 Author: Jason Altekruse Date: 2016-01-26T04:19:24Z Changing version of Drill to 1.5.0 =20 This isn't actually the 1.5.0 release, but the primary condition used to identify if corrected dates are stored in a parquet file is the Drill version included in the metadata. This version number is retrieve= d from the META-INF in the drill jar. This version number change is neede= d to make some of the regression tests pass, otherwise the 1.5.0-SNAPSHOT version will make the tests assume that the files are corrupt (as all commits before this one were writing corrupt dates). ---- > Parquet File : Date is stored wrongly > ------------------------------------- > > Key: DRILL-4203 > URL: https://issues.apache.org/jira/browse/DRILL-4203 > Project: Apache Drill > Issue Type: Bug > Affects Versions: 1.4.0 > Reporter: St=C3=A9phane Trou > Assignee: Jason Altekruse > Priority: Critical > > Hello, > I have some problems when i try to read parquet files produce by drill wi= th Spark, all dates are corrupted. > I think the problem come from drill :) > {code} > cat /tmp/date_parquet.csv=20 > Epoch,1970-01-01 > {code} > {code} > 0: jdbc:drill:zk=3Dlocal> select columns[0] as name, cast(columns[1] as d= ate) as epoch_date from dfs.tmp.`date_parquet.csv`; > +--------+-------------+ > | name | epoch_date | > +--------+-------------+ > | Epoch | 1970-01-01 | > +--------+-------------+ > {code} > {code} > 0: jdbc:drill:zk=3Dlocal> create table dfs.tmp.`buggy_parquet`as select c= olumns[0] as name, cast(columns[1] as date) as epoch_date from dfs.tmp.`dat= e_parquet.csv`; > +-----------+----------------------------+ > | Fragment | Number of records written | > +-----------+----------------------------+ > | 0_0 | 1 | > +-----------+----------------------------+ > {code} > When I read the file with parquet tools, i found =20 > {code} > java -jar parquet-tools-1.8.1.jar head /tmp/buggy_parquet/ > name =3D Epoch > epoch_date =3D 4881176 > {code} > According to [https://github.com/Parquet/parquet-format/blob/master/Logic= alTypes.md#date], epoch_date should be equals to 0. > Meta :=20 > {code} > java -jar parquet-tools-1.8.1.jar meta /tmp/buggy_parquet/ > file: file:/tmp/buggy_parquet/0_0_0.parquet=20 > creator: parquet-mr version 1.8.1-drill-r0 (build 6b605a4ea05b66e1a6b= f843353abcb4834a4ced8)=20 > extra: drill.version =3D 1.4.0=20 > file schema: root=20 > -------------------------------------------------------------------------= ------- > name: OPTIONAL BINARY O:UTF8 R:0 D:1 > epoch_date: OPTIONAL INT32 O:DATE R:0 D:1 > row group 1: RC:1 TS:93 OFFSET:4=20 > -------------------------------------------------------------------------= ------- > name: BINARY SNAPPY DO:0 FPO:4 SZ:52/50/0,96 VC:1 ENC:RLE,BIT_PAC= KED,PLAIN > epoch_date: INT32 SNAPPY DO:0 FPO:56 SZ:45/43/0,96 VC:1 ENC:RLE,BIT_PAC= KED,PLAIN > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)