Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 93CEB200BAE for ; Fri, 28 Oct 2016 18:26:17 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 9265A160AE4; Fri, 28 Oct 2016 16:26:17 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id B38E5160ACA for ; Fri, 28 Oct 2016 18:26:16 +0200 (CEST) Received: (qmail 21959 invoked by uid 500); 28 Oct 2016 16:26:15 -0000 Mailing-List: contact dev-help@drill.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@drill.apache.org Delivered-To: mailing list dev@drill.apache.org Received: (qmail 21947 invoked by uid 99); 28 Oct 2016 16:26:15 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 28 Oct 2016 16:26:15 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id 17CF91A08BE for ; Fri, 28 Oct 2016 16:26:15 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -0.102 X-Spam-Level: X-Spam-Status: No, score=-0.102 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H2=-0.001, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd2-us-west.apache.org (amavisd-new); dkim=pass (1024-bit key) header.d=maprtech.com Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id HTyFXhqKV_CJ for ; Fri, 28 Oct 2016 16:26:12 +0000 (UTC) Received: from mail-pf0-f173.google.com (mail-pf0-f173.google.com [209.85.192.173]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id CB38A5FBCF for ; Fri, 28 Oct 2016 16:26:11 +0000 (UTC) Received: by mail-pf0-f173.google.com with SMTP id n85so39819929pfi.1 for ; Fri, 28 Oct 2016 09:26:11 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=maprtech.com; s=google; h=mime-version:subject:from:in-reply-to:date :content-transfer-encoding:message-id:references:to; bh=M05m/iVyA5DoHAoA6KfkOYj4QOL8Xdb+YS7V8zqXEC0=; b=jWsmOZ6VE0ZzCgJyfLeHgR6YnM5+wLsbQHeGLWtrzpuRAZAl1z4Pm3R+l8353xKi61 T+BRaDC0+5CM4hvpYkOMGNfSNqGkufQ2y4LiKRy/HIUjchPLYmNP5C2+AwOZCkIfo+p/ rckvJQPpK+P4hd4ELobfQ6wmShHbIu6I+lOZ4= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:subject:from:in-reply-to:date :content-transfer-encoding:message-id:references:to; bh=M05m/iVyA5DoHAoA6KfkOYj4QOL8Xdb+YS7V8zqXEC0=; b=floYYOfjpsA80t2d8uKPBz9Nu2VpOnK+295haM5SesiKbPyDUe8RS0cgvjlt/TX3lj HuPzPMfpJD0HSivXshWQpoXq82/Guv4UZ9C/wT84iweJK/XBfzD4+raHasBO+5NYl2mN /JDg2bLYR/Wyfg5f8SEzunkHvpnpfeuhUx/Qb4/qrQzfF50pZqLRKn/+bMN6v92lkYUh b/PVSQF0z4eNDN76mixrJP6MDaq4FB257qjYGBQeqbs2ZoDFAEVswxxVbNxwOYA2kjIy mboz1OkVChI01qXnjhmw59ttU5nRiHWAv3Ci47vNhR6BTTawkYhFvQaAceJrtxUIbeo5 mzgw== X-Gm-Message-State: ABUngvflIBNgANBMgKeA3vazzDC5Gd76ZKn3OEE7GVZDAvCrCKB9cfb2V4oziiJfT2H6buKi X-Received: by 10.98.1.200 with SMTP id 191mr7627184pfb.102.1477671970173; Fri, 28 Oct 2016 09:26:10 -0700 (PDT) Received: from [172.16.10.237] (c-67-160-221-98.hsd1.ca.comcast.net. [67.160.221.98]) by smtp.gmail.com with ESMTPSA id w3sm20018719paa.41.2016.10.28.09.26.09 for (version=TLS1 cipher=ECDHE-RSA-AES128-SHA bits=128/128); Fri, 28 Oct 2016 09:26:09 -0700 (PDT) Content-Type: text/plain; charset=utf-8 Mime-Version: 1.0 (Mac OS X Mail 9.3 \(3124\)) Subject: Re: isDateCorrect field in ParquetTableMetadata From: Paul Rogers In-Reply-To: Date: Fri, 28 Oct 2016 09:26:08 -0700 Content-Transfer-Encoding: quoted-printable Message-Id: <1466CA7E-E4E5-4D22-A655-5B82A5D58EB7@maprtech.com> References: <95CF8C46-AF7F-4FFF-A5D2-9668C5330FD6@maprtech.com> To: "dev@drill.apache.org" X-Mailer: Apple Mail (2.3124) archived-at: Fri, 28 Oct 2016 16:26:17 -0000 Thanks Vitalii. The Parquet Writer solution =E2=80=9Cjust works=E2=80=9D. As soon as = someone upgrades the writer, files are labeled as having that new = version. No fuzziness during a release as in 1.9. It is fine to also include the Drill version. But, format decisions = should be keyed off of the writer version. By the way, do other tools happen to already do this? It would be rather = surprising if they didn=E2=80=99t. - Paul > On Oct 28, 2016, at 8:30 AM, Vitalii Diravka = wrote: >=20 > I agree that it would be good if the approach of parquet date = correctness > detection will be upgraded. So I created the jira for it DRILL-4980 > . >=20 > But now we have two ideas: > 1. To add checking of the drill version additionally, so later we can > delete isDateCorrect label from parquet metadata. > 2. To add parquet writer version to the parquet metadata and check = this > value instead of isDateCorrect and drillVersion. >=20 > So which way, we should prefer now? >=20 > Kind regards > Vitalii >=20 > 2016-10-27 23:54 GMT+00:00 Paul Rogers : >=20 >> FWIW: back on the magic flag issue=E2=80=A6 >>=20 >> I noted Vitali=E2=80=99s concern about =E2=80=9C1.9=E2=80=9D and = =E2=80=9C1.9-SNAPSHOT=E2=80=9D being too course >> grained for our needs. >>=20 >> A typical solution is include the version of the Parquet writer in >> addition to that of Drill. Each time we change something in the = writer, >> increment the version number. If we number changes, we can easily = handle >> two changes in the same Drill release, or differentiate between the = =E2=80=9Cearly >> 1.9=E2=80=9D files with old-style dates and =E2=80=9Clate 1.9=E2=80=9D = files with correct dates. >>=20 >> Since we have no version now, start it at some arbitrary point (2?). >>=20 >> Now, if the Parquet file has a Drill Writer version in the header, = and >> that version is 2 or greater, the date is in the =E2=80=9Ccorrect=E2=80= =9D format. Anything >> written by Drill before writer version 2, the date is wrong. The = =E2=80=9Ccheck the >> data to see if it is sane=E2=80=9D approach is needed only for files = were we can=E2=80=99t >> tell if an older Drill wrote it. >>=20 >> Do other tools label the data? Does Hive say that it wrote the file? = If >> so, we don=E2=80=99t need to do the sanity check if we can tell the = data comes from >> Hive (or Impala, or anything other than old Drill.) >>=20 >> - Paul >>=20 >>> On Oct 27, 2016, at 4:03 PM, Zelaine Fong = wrote: >>>=20 >>> Vitalii -- are you still planning to open a ticket and pull request = for >> the >>> fix you've noted below? >>>=20 >>> -- Zelaine >>>=20 >>> On Wed, Oct 26, 2016 at 8:28 AM, Vitalii Diravka < >> vitalii.diravka@gmail.com> >>> wrote: >>>=20 >>>> @Paul Rogers >>>> It may be the undefined case when the file is generated with >> drill.version >>>> =3D 1.9-SNAPSHOT. >>>> It is more easy to determine corrupted date with this flag and = there is >> no >>>> need to wait the end of release to merge these changes. >>>>=20 >>>> @Jinfeng NI >>>> It looks like you are right. >>>> With consistent mode (isDateCorrect =3D true) all tests are passed. = So I >> am >>>> going to open a jira ticket for it with next changes >>>> = https://github.com/vdiravka/drill/commit/ff8d5c7d601915f760d1b0e9618730 >>>> 3410cac5d3 >>>> Thanks. >>>>=20 >>>> Kind regards >>>> Vitalii >>>>=20 >>>> 2016-10-25 18:36 GMT+00:00 Jinfeng Ni : >>>>=20 >>>>> I'm not sure if I fully understand your answers. The bottom line = is >>>>> quite simple: given a set of parquet files, the ParquetTableMeta >>>>> instance constructed in Drill should have identical value for >>>>> "isDateCorrect", whether it comes from parquet footer, or parquet >>>>> metadata cache, or whether there is partition pruning or not. = However, >>>>> the code shows that this flag is not in consistent mode across >>>>> different cases. >>>>>=20 >>>>>=20 >>>>>=20 >>>>> On Tue, Oct 25, 2016 at 11:24 AM, Vitalii Diravka >>>>> wrote: >>>>>> Hi Jinfeng, >>>>>>=20 >>>>>> 1.If the parquet files are generated with Drill after Drill-4203 = these >>>>>> files have "isDateCorrect =3D true" property. >>>>>> Drill serializes this property from metadata now. When we set = this >>>>> property >>>>>> in the first constructor we will hide the value from metadata. >>>>>> IsDateCorrect will be false only if this value equals to the = false (no >>>>> case >>>>>> for it now) or absent in parquet metadata footer. >>>>>>=20 >>>>>>=20 >>>>>> 2. I'm not sure the reason to change isDateCorrect metadata = property >>>> when >>>>>> the user disable dates correction. >>>>>> If you have some use case it would be great if you provide it. >>>>>>=20 >>>>>> 3. Maybe you are right regarding to when Parquet metadata is = cloned. >>>>>> Here I added the property in the same manner as Jason's new = property >>>>>> "drillVersion. So need it a separate unit test? >>>>>>=20 >>>>>>=20 >>>>>> Kind regards >>>>>> Vitalii >>>>>>=20 >>>>>> 2016-10-25 16:23 GMT+00:00 Jinfeng Ni : >>>>>>=20 >>>>>>> Forgot to copy the link to the code. >>>>>>>=20 >>>>>>> [1] https://github.com/apache/drill/blob/master/exec/java- >>>>>>> exec/src/main/java/org/apache/drill/exec/store/parquet/ >>>>>>> Metadata.java#L950-L955 >>>>>>>=20 >>>>>>> On Tue, Oct 25, 2016 at 9:16 AM, Jinfeng Ni = wrote: >>>>>>>> @Jason, @Vitalli, >>>>>>>>=20 >>>>>>>> Any thoughts on this question, since both you worked on fix of >>>>>>> DRILL-4203? >>>>>>>>=20 >>>>>>>> Looking through the code, there is a third case [1], where this = flag >>>>>>>> is set to false when Parquet metadata is cloned (after = partition >>>>>>>> pruning, etc). That means, for the 2nd case where the flag is = set >>>> to >>>>>>>> true, if there is pruning happening, the new parquet metadata = will >>>> see >>>>>>>> the flag is flipped to false. This does not make sense to me. >>>>>>>>=20 >>>>>>>>=20 >>>>>>>>=20 >>>>>>>> On Mon, Oct 24, 2016 at 3:10 PM, Jinfeng Ni = wrote: >>>>>>>>> Hello All, >>>>>>>>>=20 >>>>>>>>> DRILL-4203 addressed the date field issue. In the fix, it >>>> introduced >>>>>>>>> a new field in ParquetTableMetadata_v2 : isDateCorrect. I = have >>>> some >>>>>>>>> difficulty in understanding the meaning of this field. >>>>>>>>>=20 >>>>>>>>> According to [1], this field is set to false, when Drill gets >>>> parquet >>>>>>>>> metadata from parquet footer. This field is set to true in = code >>>>> flow >>>>>>>>> of [2] and [3], when Drill gets parquet metadata from meta = data >>>>> cache. >>>>>>>>>=20 >>>>>>>>> Questions I have: >>>>>>>>> 1. If the parquet files are generated with Drill after = DRILL-4203, >>>>>>>>> Drill still thinks date field is NOT correct (isDateCorrect =3D >>>> false)? >>>>>>>>> 2. Why does this filed have nothing to do with = "autoCorrection" >>>> flag >>>>>>>>> [4]? If someone turns off autoCorrection, will it have impact = on >>>>> this >>>>>>>>> "isDateCorrect" flag ? >>>>>>>>>=20 >>>>>>>>> Thanks in advance for any input, >>>>>>>>>=20 >>>>>>>>> Jinfeng >>>>>>>>>=20 >>>>>>>>>=20 >>>>>>>>> [1] https://github.com/apache/drill/blob/master/exec/java- >>>>>>> exec/src/main/java/org/apache/drill/exec/store/parquet/ >>>>> Metadata.java#L932 >>>>>>>>> [2] https://github.com/apache/drill/blob/master/exec/java- >>>>>>> exec/src/main/java/org/apache/drill/exec/store/parquet/ >>>>> Metadata.java#L936 >>>>>>>>> [3] https://github.com/apache/drill/blob/master/exec/java- >>>>>>> exec/src/main/java/org/apache/drill/exec/store/parquet/ >>>>> Metadata.java#L187 >>>>>>>>> [4] https://github.com/apache/drill/blob/master/exec/java- >>>>>>> exec/src/main/java/org/apache/drill/exec/store/parquet/ >>>>>>> Metadata.java#L354-L355 >>>>>>>=20 >>>>>=20 >>>>=20 >>=20 >>=20