Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 414A4200BAE for ; Fri, 28 Oct 2016 22:46:08 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 3FFCC160AE4; Fri, 28 Oct 2016 20:46:08 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 5EA9D160ACA for ; Fri, 28 Oct 2016 22:46:07 +0200 (CEST) Received: (qmail 24403 invoked by uid 500); 28 Oct 2016 20:46:06 -0000 Mailing-List: contact dev-help@drill.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@drill.apache.org Delivered-To: mailing list dev@drill.apache.org Received: (qmail 24392 invoked by uid 99); 28 Oct 2016 20:46:06 -0000 Received: from mail-relay.apache.org (HELO mail-relay.apache.org) (140.211.11.15) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 28 Oct 2016 20:46:06 +0000 Received: from mail-oi0-f51.google.com (mail-oi0-f51.google.com [209.85.218.51]) by mail-relay.apache.org (ASF Mail Server at mail-relay.apache.org) with ESMTPSA id 198B41A0143 for ; Fri, 28 Oct 2016 20:46:06 +0000 (UTC) Received: by mail-oi0-f51.google.com with SMTP id y2so143212281oie.0 for ; Fri, 28 Oct 2016 13:46:06 -0700 (PDT) X-Gm-Message-State: ABUngvfDhiSql6KYj0tp6BT3HXD/HJSIoacjZF8Vn+dKJa85Nfel0AAqdHBzQAdMUDJ1ZsdPtuUDHLtBabhisg== X-Received: by 10.202.73.141 with SMTP id w135mr15117782oia.40.1477687565050; Fri, 28 Oct 2016 13:46:05 -0700 (PDT) MIME-Version: 1.0 Received: by 10.157.60.74 with HTTP; Fri, 28 Oct 2016 13:46:04 -0700 (PDT) In-Reply-To: References: <95CF8C46-AF7F-4FFF-A5D2-9668C5330FD6@maprtech.com> <1466CA7E-E4E5-4D22-A655-5B82A5D58EB7@maprtech.com> From: Jinfeng Ni Date: Fri, 28 Oct 2016 13:46:04 -0700 X-Gmail-Original-Message-ID: Message-ID: Subject: Re: isDateCorrect field in ParquetTableMetadata To: dev Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable archived-at: Fri, 28 Oct 2016 20:46:08 -0000 Thanks for the explanation, Jason. The three different values for DateCorruptionStatus make sense to me. The isDataCorrect flag =3D true, means that the values are known to be corr= ect. The isDataCorrect flag =3D false, means that the values are know to be incorrect, or unclear? On Fri, Oct 28, 2016 at 12:59 PM, Jason Altekruse wrote: > The isDataCorrect flag means that the values are known to be correct, and > there is no need to auto-detect corruption or correct anything. > > META_SHOWS_CORRUPTION can be set either when we have a known old version = of > Drill written in the metadata, or we have older files that might have bee= n > written by Drill that we have checked the values in the statistics and > found corrupt looking values. Really old files without any statistics don= 't > have information that allows us to identify them as Drill-produced, so we > have to test the values during actual page reads, this is where > META_UNCLEAR_TEST_VALUES is used. > > Jason Altekruse > Software Engineer at Dremio > Apache Drill Committer > > On Fri, Oct 28, 2016 at 12:53 PM, Jinfeng Ni wrote: > >> Hi Vitalli, >> >> DateCorruptionStatus has three possibilities: META_SHOWS_CORRUPTION, >> META_SHOWS_NO_CORRUPTION, META_UNCLEAR_TEST_VALUES. What value will >> this isDateCorrect flag have for each possiblity, especially for >> META_UNCLEAR_TEST_VALUES? Are DateCorruptionStatus and isDateCorrect >> same things, or different? >> >> Thanks. >> >> Jinfeng >> >> >> >> On Fri, Oct 28, 2016 at 9:26 AM, Paul Rogers wrot= e: >> > Thanks Vitalii. >> > >> > The Parquet Writer solution =E2=80=9Cjust works=E2=80=9D. As soon as s= omeone upgrades >> the writer, files are labeled as having that new version. No fuzziness >> during a release as in 1.9. >> > >> > It is fine to also include the Drill version. But, format decisions >> should be keyed off of the writer version. >> > >> > By the way, do other tools happen to already do this? It would be rath= er >> surprising if they didn=E2=80=99t. >> > >> > - Paul >> > >> >> On Oct 28, 2016, at 8:30 AM, Vitalii Diravka >> wrote: >> >> >> >> I agree that it would be good if the approach of parquet date >> correctness >> >> detection will be upgraded. So I created the jira for it DRILL-4980 >> >> . >> >> >> >> But now we have two ideas: >> >> 1. To add checking of the drill version additionally, so later we can >> >> delete isDateCorrect label from parquet metadata. >> >> 2. To add parquet writer version to the parquet metadata and check th= is >> >> value instead of isDateCorrect and drillVersion. >> >> >> >> So which way, we should prefer now? >> >> >> >> Kind regards >> >> Vitalii >> >> >> >> 2016-10-27 23:54 GMT+00:00 Paul Rogers : >> >> >> >>> FWIW: back on the magic flag issue=E2=80=A6 >> >>> >> >>> I noted Vitali=E2=80=99s concern about =E2=80=9C1.9=E2=80=9D and =E2= =80=9C1.9-SNAPSHOT=E2=80=9D being too >> course >> >>> grained for our needs. >> >>> >> >>> A typical solution is include the version of the Parquet writer in >> >>> addition to that of Drill. Each time we change something in the writ= er, >> >>> increment the version number. If we number changes, we can easily >> handle >> >>> two changes in the same Drill release, or differentiate between the >> =E2=80=9Cearly >> >>> 1.9=E2=80=9D files with old-style dates and =E2=80=9Clate 1.9=E2=80= =9D files with correct >> dates. >> >>> >> >>> Since we have no version now, start it at some arbitrary point (2?). >> >>> >> >>> Now, if the Parquet file has a Drill Writer version in the header, a= nd >> >>> that version is 2 or greater, the date is in the =E2=80=9Ccorrect=E2= =80=9D format. >> Anything >> >>> written by Drill before writer version 2, the date is wrong. The >> =E2=80=9Ccheck the >> >>> data to see if it is sane=E2=80=9D approach is needed only for files= were we >> can=E2=80=99t >> >>> tell if an older Drill wrote it. >> >>> >> >>> Do other tools label the data? Does Hive say that it wrote the file?= If >> >>> so, we don=E2=80=99t need to do the sanity check if we can tell the = data comes >> from >> >>> Hive (or Impala, or anything other than old Drill.) >> >>> >> >>> - Paul >> >>> >> >>>> On Oct 27, 2016, at 4:03 PM, Zelaine Fong wrot= e: >> >>>> >> >>>> Vitalii -- are you still planning to open a ticket and pull request >> for >> >>> the >> >>>> fix you've noted below? >> >>>> >> >>>> -- Zelaine >> >>>> >> >>>> On Wed, Oct 26, 2016 at 8:28 AM, Vitalii Diravka < >> >>> vitalii.diravka@gmail.com> >> >>>> wrote: >> >>>> >> >>>>> @Paul Rogers >> >>>>> It may be the undefined case when the file is generated with >> >>> drill.version >> >>>>> =3D 1.9-SNAPSHOT. >> >>>>> It is more easy to determine corrupted date with this flag and the= re >> is >> >>> no >> >>>>> need to wait the end of release to merge these changes. >> >>>>> >> >>>>> @Jinfeng NI >> >>>>> It looks like you are right. >> >>>>> With consistent mode (isDateCorrect =3D true) all tests are passed= . So >> I >> >>> am >> >>>>> going to open a jira ticket for it with next changes >> >>>>> https://github.com/vdiravka/drill/commit/ >> ff8d5c7d601915f760d1b0e9618730 >> >>>>> 3410cac5d3 >> >>>>> Thanks. >> >>>>> >> >>>>> Kind regards >> >>>>> Vitalii >> >>>>> >> >>>>> 2016-10-25 18:36 GMT+00:00 Jinfeng Ni : >> >>>>> >> >>>>>> I'm not sure if I fully understand your answers. The bottom line = is >> >>>>>> quite simple: given a set of parquet files, the ParquetTableMeta >> >>>>>> instance constructed in Drill should have identical value for >> >>>>>> "isDateCorrect", whether it comes from parquet footer, or parquet >> >>>>>> metadata cache, or whether there is partition pruning or not. >> However, >> >>>>>> the code shows that this flag is not in consistent mode across >> >>>>>> different cases. >> >>>>>> >> >>>>>> >> >>>>>> >> >>>>>> On Tue, Oct 25, 2016 at 11:24 AM, Vitalii Diravka >> >>>>>> wrote: >> >>>>>>> Hi Jinfeng, >> >>>>>>> >> >>>>>>> 1.If the parquet files are generated with Drill after Drill-4203 >> these >> >>>>>>> files have "isDateCorrect =3D true" property. >> >>>>>>> Drill serializes this property from metadata now. When we set th= is >> >>>>>> property >> >>>>>>> in the first constructor we will hide the value from metadata. >> >>>>>>> IsDateCorrect will be false only if this value equals to the fal= se >> (no >> >>>>>> case >> >>>>>>> for it now) or absent in parquet metadata footer. >> >>>>>>> >> >>>>>>> >> >>>>>>> 2. I'm not sure the reason to change isDateCorrect metadata >> property >> >>>>> when >> >>>>>>> the user disable dates correction. >> >>>>>>> If you have some use case it would be great if you provide it. >> >>>>>>> >> >>>>>>> 3. Maybe you are right regarding to when Parquet metadata is >> cloned. >> >>>>>>> Here I added the property in the same manner as Jason's new >> property >> >>>>>>> "drillVersion. So need it a separate unit test? >> >>>>>>> >> >>>>>>> >> >>>>>>> Kind regards >> >>>>>>> Vitalii >> >>>>>>> >> >>>>>>> 2016-10-25 16:23 GMT+00:00 Jinfeng Ni : >> >>>>>>> >> >>>>>>>> Forgot to copy the link to the code. >> >>>>>>>> >> >>>>>>>> [1] https://github.com/apache/drill/blob/master/exec/java- >> >>>>>>>> exec/src/main/java/org/apache/drill/exec/store/parquet/ >> >>>>>>>> Metadata.java#L950-L955 >> >>>>>>>> >> >>>>>>>> On Tue, Oct 25, 2016 at 9:16 AM, Jinfeng Ni >> wrote: >> >>>>>>>>> @Jason, @Vitalli, >> >>>>>>>>> >> >>>>>>>>> Any thoughts on this question, since both you worked on fix of >> >>>>>>>> DRILL-4203? >> >>>>>>>>> >> >>>>>>>>> Looking through the code, there is a third case [1], where thi= s >> flag >> >>>>>>>>> is set to false when Parquet metadata is cloned (after partiti= on >> >>>>>>>>> pruning, etc). That means, for the 2nd case where the flag is >> set >> >>>>> to >> >>>>>>>>> true, if there is pruning happening, the new parquet metadata >> will >> >>>>> see >> >>>>>>>>> the flag is flipped to false. This does not make sense to me. >> >>>>>>>>> >> >>>>>>>>> >> >>>>>>>>> >> >>>>>>>>> On Mon, Oct 24, 2016 at 3:10 PM, Jinfeng Ni >> wrote: >> >>>>>>>>>> Hello All, >> >>>>>>>>>> >> >>>>>>>>>> DRILL-4203 addressed the date field issue. In the fix, it >> >>>>> introduced >> >>>>>>>>>> a new field in ParquetTableMetadata_v2 : isDateCorrect. I ha= ve >> >>>>> some >> >>>>>>>>>> difficulty in understanding the meaning of this field. >> >>>>>>>>>> >> >>>>>>>>>> According to [1], this field is set to false, when Drill gets >> >>>>> parquet >> >>>>>>>>>> metadata from parquet footer. This field is set to true in >> code >> >>>>>> flow >> >>>>>>>>>> of [2] and [3], when Drill gets parquet metadata from meta da= ta >> >>>>>> cache. >> >>>>>>>>>> >> >>>>>>>>>> Questions I have: >> >>>>>>>>>> 1. If the parquet files are generated with Drill after >> DRILL-4203, >> >>>>>>>>>> Drill still thinks date field is NOT correct (isDateCorrect = =3D >> >>>>> false)? >> >>>>>>>>>> 2. Why does this filed have nothing to do with "autoCorrecti= on" >> >>>>> flag >> >>>>>>>>>> [4]? If someone turns off autoCorrection, will it have impac= t >> on >> >>>>>> this >> >>>>>>>>>> "isDateCorrect" flag ? >> >>>>>>>>>> >> >>>>>>>>>> Thanks in advance for any input, >> >>>>>>>>>> >> >>>>>>>>>> Jinfeng >> >>>>>>>>>> >> >>>>>>>>>> >> >>>>>>>>>> [1] https://github.com/apache/drill/blob/master/exec/java- >> >>>>>>>> exec/src/main/java/org/apache/drill/exec/store/parquet/ >> >>>>>> Metadata.java#L932 >> >>>>>>>>>> [2] https://github.com/apache/drill/blob/master/exec/java- >> >>>>>>>> exec/src/main/java/org/apache/drill/exec/store/parquet/ >> >>>>>> Metadata.java#L936 >> >>>>>>>>>> [3] https://github.com/apache/drill/blob/master/exec/java- >> >>>>>>>> exec/src/main/java/org/apache/drill/exec/store/parquet/ >> >>>>>> Metadata.java#L187 >> >>>>>>>>>> [4] https://github.com/apache/drill/blob/master/exec/java- >> >>>>>>>> exec/src/main/java/org/apache/drill/exec/store/parquet/ >> >>>>>>>> Metadata.java#L354-L355 >> >>>>>>>> >> >>>>>> >> >>>>> >> >>> >> >>> >> > >>