drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jason Altekruse <ja...@dremio.com>
Subject Re: isDateCorrect field in ParquetTableMetadata
Date Fri, 28 Oct 2016 19:59:18 GMT
The isDataCorrect flag means that the values are known to be correct, and
there is no need to auto-detect corruption or correct anything.

META_SHOWS_CORRUPTION can be set either when we have a known old version of
Drill written in the metadata, or we have older files that might have been
written by Drill that we have checked the values in the statistics and
found corrupt looking values. Really old files without any statistics don't
have information that allows us to identify them as Drill-produced, so we
have to test the values during actual page reads, this is where
META_UNCLEAR_TEST_VALUES is used.

Jason Altekruse
Software Engineer at Dremio
Apache Drill Committer

On Fri, Oct 28, 2016 at 12:53 PM, Jinfeng Ni <jni@apache.org> wrote:

> Hi Vitalli,
>
> DateCorruptionStatus has three possibilities: META_SHOWS_CORRUPTION,
> META_SHOWS_NO_CORRUPTION, META_UNCLEAR_TEST_VALUES.  What value will
> this isDateCorrect flag have for each possiblity, especially for
> META_UNCLEAR_TEST_VALUES? Are DateCorruptionStatus and isDateCorrect
> same things, or different?
>
> Thanks.
>
> Jinfeng
>
>
>
> On Fri, Oct 28, 2016 at 9:26 AM, Paul Rogers <progers@maprtech.com> wrote:
> > Thanks Vitalii.
> >
> > The Parquet Writer solution “just works”. As soon as someone upgrades
> the writer, files are labeled as having that new version. No fuzziness
> during a release as in 1.9.
> >
> > It is fine to also include the Drill version. But, format decisions
> should be keyed off of the writer version.
> >
> > By the way, do other tools happen to already do this? It would be rather
> surprising if they didn’t.
> >
> > - Paul
> >
> >> On Oct 28, 2016, at 8:30 AM, Vitalii Diravka <vitalii.diravka@gmail.com>
> wrote:
> >>
> >> I agree that it would be good if the approach of parquet date
> correctness
> >> detection will be upgraded. So I created the jira for it DRILL-4980
> >> <https://issues.apache.org/jira/browse/DRILL-4980>.
> >>
> >> But now we have two ideas:
> >> 1. To add checking of the drill version additionally, so later we can
> >> delete isDateCorrect label from parquet metadata.
> >> 2. To add parquet writer version to the parquet metadata and check this
> >> value instead of isDateCorrect and drillVersion.
> >>
> >> So which way, we should prefer now?
> >>
> >> Kind regards
> >> Vitalii
> >>
> >> 2016-10-27 23:54 GMT+00:00 Paul Rogers <progers@maprtech.com>:
> >>
> >>> FWIW: back on the magic flag issue…
> >>>
> >>> I noted Vitali’s concern about “1.9” and “1.9-SNAPSHOT” being
too
> course
> >>> grained for our needs.
> >>>
> >>> A typical solution is include the version of the Parquet writer in
> >>> addition to that of Drill. Each time we change something in the writer,
> >>> increment the version number. If we number changes, we can easily
> handle
> >>> two changes in the same Drill release, or differentiate between the
> “early
> >>> 1.9” files with old-style dates and “late 1.9” files with correct
> dates.
> >>>
> >>> Since we have no version now, start it at some arbitrary point (2?).
> >>>
> >>> Now, if the Parquet file has a Drill Writer version in the header, and
> >>> that version is 2 or greater, the date is in the “correct” format.
> Anything
> >>> written by Drill before writer version 2, the date is wrong. The
> “check the
> >>> data to see if it is sane” approach is needed only for files were we
> can’t
> >>> tell if an older Drill wrote it.
> >>>
> >>> Do other tools label the data? Does Hive say that it wrote the file? If
> >>> so, we don’t need to do the sanity check if we can tell the data comes
> from
> >>> Hive (or Impala, or anything other than old Drill.)
> >>>
> >>> - Paul
> >>>
> >>>> On Oct 27, 2016, at 4:03 PM, Zelaine Fong <zfong@maprtech.com>
wrote:
> >>>>
> >>>> Vitalii -- are you still planning to open a ticket and pull request
> for
> >>> the
> >>>> fix you've noted below?
> >>>>
> >>>> -- Zelaine
> >>>>
> >>>> On Wed, Oct 26, 2016 at 8:28 AM, Vitalii Diravka <
> >>> vitalii.diravka@gmail.com>
> >>>> wrote:
> >>>>
> >>>>> @Paul Rogers
> >>>>> It may be the undefined case when the file is generated with
> >>> drill.version
> >>>>> = 1.9-SNAPSHOT.
> >>>>> It is more easy to determine corrupted date with this flag and there
> is
> >>> no
> >>>>> need to wait the end of release to merge these changes.
> >>>>>
> >>>>> @Jinfeng NI
> >>>>> It looks like you are right.
> >>>>> With consistent mode (isDateCorrect = true) all tests are passed.
So
> I
> >>> am
> >>>>> going to open a jira ticket for it with next changes
> >>>>> https://github.com/vdiravka/drill/commit/
> ff8d5c7d601915f760d1b0e9618730
> >>>>> 3410cac5d3
> >>>>> Thanks.
> >>>>>
> >>>>> Kind regards
> >>>>> Vitalii
> >>>>>
> >>>>> 2016-10-25 18:36 GMT+00:00 Jinfeng Ni <jni@apache.org>:
> >>>>>
> >>>>>> I'm not sure if I fully understand your answers. The bottom
line is
> >>>>>> quite simple: given a set of parquet files, the ParquetTableMeta
> >>>>>> instance constructed in Drill should have identical value for
> >>>>>> "isDateCorrect", whether it comes from parquet footer, or parquet
> >>>>>> metadata cache, or whether there is partition pruning or not.
> However,
> >>>>>> the code shows that this flag is not in consistent mode across
> >>>>>> different cases.
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> On Tue, Oct 25, 2016 at 11:24 AM, Vitalii Diravka
> >>>>>> <vitalii.diravka@gmail.com> wrote:
> >>>>>>> Hi Jinfeng,
> >>>>>>>
> >>>>>>> 1.If the parquet files are generated with Drill after Drill-4203
> these
> >>>>>>> files have "isDateCorrect = true" property.
> >>>>>>> Drill serializes this property from metadata now. When we
set this
> >>>>>> property
> >>>>>>> in the first constructor we will hide the value from metadata.
> >>>>>>> IsDateCorrect will be false only if this value equals to
the false
> (no
> >>>>>> case
> >>>>>>> for it now) or absent in parquet metadata footer.
> >>>>>>>
> >>>>>>>
> >>>>>>> 2. I'm not sure the reason to change isDateCorrect metadata
> property
> >>>>> when
> >>>>>>> the user disable dates correction.
> >>>>>>> If you have some use case it would be great if you provide
it.
> >>>>>>>
> >>>>>>> 3. Maybe you are right regarding to when Parquet metadata
is
> cloned.
> >>>>>>> Here I added the property in the same manner as Jason's
new
> property
> >>>>>>> "drillVersion. So need it a separate unit test?
> >>>>>>>
> >>>>>>>
> >>>>>>> Kind regards
> >>>>>>> Vitalii
> >>>>>>>
> >>>>>>> 2016-10-25 16:23 GMT+00:00 Jinfeng Ni <jni@apache.org>:
> >>>>>>>
> >>>>>>>> Forgot to copy the link to the code.
> >>>>>>>>
> >>>>>>>> [1] https://github.com/apache/drill/blob/master/exec/java-
> >>>>>>>> exec/src/main/java/org/apache/drill/exec/store/parquet/
> >>>>>>>> Metadata.java#L950-L955
> >>>>>>>>
> >>>>>>>> On Tue, Oct 25, 2016 at 9:16 AM, Jinfeng Ni <jni@apache.org>
> wrote:
> >>>>>>>>> @Jason, @Vitalli,
> >>>>>>>>>
> >>>>>>>>> Any thoughts on this question, since both you worked
on fix of
> >>>>>>>> DRILL-4203?
> >>>>>>>>>
> >>>>>>>>> Looking through the code, there is a third case
[1], where this
> flag
> >>>>>>>>> is set to false when Parquet metadata is cloned
(after partition
> >>>>>>>>> pruning, etc).  That means, for the 2nd case where
the flag is
> set
> >>>>> to
> >>>>>>>>> true, if there is pruning happening, the new parquet
metadata
> will
> >>>>> see
> >>>>>>>>> the flag is flipped to false. This does not make
sense to me.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> On Mon, Oct 24, 2016 at 3:10 PM, Jinfeng Ni <jni@apache.org>
> wrote:
> >>>>>>>>>> Hello All,
> >>>>>>>>>>
> >>>>>>>>>> DRILL-4203 addressed the date field issue. 
In the fix, it
> >>>>> introduced
> >>>>>>>>>> a new field in ParquetTableMetadata_v2 : isDateCorrect.
 I have
> >>>>> some
> >>>>>>>>>> difficulty in understanding the meaning of this
field.
> >>>>>>>>>>
> >>>>>>>>>> According to [1], this field is set to false,
when Drill gets
> >>>>> parquet
> >>>>>>>>>> metadata from parquet footer.  This field is
 set to true in
> code
> >>>>>> flow
> >>>>>>>>>> of [2] and [3], when Drill gets parquet metadata
from meta data
> >>>>>> cache.
> >>>>>>>>>>
> >>>>>>>>>> Questions I have:
> >>>>>>>>>> 1.  If the parquet files are generated with
Drill after
> DRILL-4203,
> >>>>>>>>>> Drill still thinks date field is NOT correct
(isDateCorrect =
> >>>>> false)?
> >>>>>>>>>> 2.  Why does this filed have nothing to do with
"autoCorrection"
> >>>>> flag
> >>>>>>>>>> [4]?  If someone turns off autoCorrection, will
it have impact
> on
> >>>>>> this
> >>>>>>>>>> "isDateCorrect" flag ?
> >>>>>>>>>>
> >>>>>>>>>> Thanks in advance for any input,
> >>>>>>>>>>
> >>>>>>>>>> Jinfeng
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> [1] https://github.com/apache/drill/blob/master/exec/java-
> >>>>>>>> exec/src/main/java/org/apache/drill/exec/store/parquet/
> >>>>>> Metadata.java#L932
> >>>>>>>>>> [2] https://github.com/apache/drill/blob/master/exec/java-
> >>>>>>>> exec/src/main/java/org/apache/drill/exec/store/parquet/
> >>>>>> Metadata.java#L936
> >>>>>>>>>> [3] https://github.com/apache/drill/blob/master/exec/java-
> >>>>>>>> exec/src/main/java/org/apache/drill/exec/store/parquet/
> >>>>>> Metadata.java#L187
> >>>>>>>>>> [4] https://github.com/apache/drill/blob/master/exec/java-
> >>>>>>>> exec/src/main/java/org/apache/drill/exec/store/parquet/
> >>>>>>>> Metadata.java#L354-L355
> >>>>>>>>
> >>>>>>
> >>>>>
> >>>
> >>>
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message