drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jinfeng Ni <...@apache.org>
Subject Re: isDateCorrect field in ParquetTableMetadata
Date Fri, 28 Oct 2016 20:46:04 GMT
Thanks for the explanation, Jason.

The three different values for DateCorruptionStatus make sense to me.

The isDataCorrect flag = true, means that the values are known to be correct.
The isDataCorrect flag = false, means that the values are know to be
incorrect, or unclear?


On Fri, Oct 28, 2016 at 12:59 PM, Jason Altekruse <jason@dremio.com> wrote:
> The isDataCorrect flag means that the values are known to be correct, and
> there is no need to auto-detect corruption or correct anything.
>
> META_SHOWS_CORRUPTION can be set either when we have a known old version of
> Drill written in the metadata, or we have older files that might have been
> written by Drill that we have checked the values in the statistics and
> found corrupt looking values. Really old files without any statistics don't
> have information that allows us to identify them as Drill-produced, so we
> have to test the values during actual page reads, this is where
> META_UNCLEAR_TEST_VALUES is used.
>
> Jason Altekruse
> Software Engineer at Dremio
> Apache Drill Committer
>
> On Fri, Oct 28, 2016 at 12:53 PM, Jinfeng Ni <jni@apache.org> wrote:
>
>> Hi Vitalli,
>>
>> DateCorruptionStatus has three possibilities: META_SHOWS_CORRUPTION,
>> META_SHOWS_NO_CORRUPTION, META_UNCLEAR_TEST_VALUES.  What value will
>> this isDateCorrect flag have for each possiblity, especially for
>> META_UNCLEAR_TEST_VALUES? Are DateCorruptionStatus and isDateCorrect
>> same things, or different?
>>
>> Thanks.
>>
>> Jinfeng
>>
>>
>>
>> On Fri, Oct 28, 2016 at 9:26 AM, Paul Rogers <progers@maprtech.com> wrote:
>> > Thanks Vitalii.
>> >
>> > The Parquet Writer solution “just works”. As soon as someone upgrades
>> the writer, files are labeled as having that new version. No fuzziness
>> during a release as in 1.9.
>> >
>> > It is fine to also include the Drill version. But, format decisions
>> should be keyed off of the writer version.
>> >
>> > By the way, do other tools happen to already do this? It would be rather
>> surprising if they didn’t.
>> >
>> > - Paul
>> >
>> >> On Oct 28, 2016, at 8:30 AM, Vitalii Diravka <vitalii.diravka@gmail.com>
>> wrote:
>> >>
>> >> I agree that it would be good if the approach of parquet date
>> correctness
>> >> detection will be upgraded. So I created the jira for it DRILL-4980
>> >> <https://issues.apache.org/jira/browse/DRILL-4980>.
>> >>
>> >> But now we have two ideas:
>> >> 1. To add checking of the drill version additionally, so later we can
>> >> delete isDateCorrect label from parquet metadata.
>> >> 2. To add parquet writer version to the parquet metadata and check this
>> >> value instead of isDateCorrect and drillVersion.
>> >>
>> >> So which way, we should prefer now?
>> >>
>> >> Kind regards
>> >> Vitalii
>> >>
>> >> 2016-10-27 23:54 GMT+00:00 Paul Rogers <progers@maprtech.com>:
>> >>
>> >>> FWIW: back on the magic flag issue…
>> >>>
>> >>> I noted Vitali’s concern about “1.9” and “1.9-SNAPSHOT” being
too
>> course
>> >>> grained for our needs.
>> >>>
>> >>> A typical solution is include the version of the Parquet writer in
>> >>> addition to that of Drill. Each time we change something in the writer,
>> >>> increment the version number. If we number changes, we can easily
>> handle
>> >>> two changes in the same Drill release, or differentiate between the
>> “early
>> >>> 1.9” files with old-style dates and “late 1.9” files with correct
>> dates.
>> >>>
>> >>> Since we have no version now, start it at some arbitrary point (2?).
>> >>>
>> >>> Now, if the Parquet file has a Drill Writer version in the header, and
>> >>> that version is 2 or greater, the date is in the “correct” format.
>> Anything
>> >>> written by Drill before writer version 2, the date is wrong. The
>> “check the
>> >>> data to see if it is sane” approach is needed only for files were
we
>> can’t
>> >>> tell if an older Drill wrote it.
>> >>>
>> >>> Do other tools label the data? Does Hive say that it wrote the file?
If
>> >>> so, we don’t need to do the sanity check if we can tell the data comes
>> from
>> >>> Hive (or Impala, or anything other than old Drill.)
>> >>>
>> >>> - Paul
>> >>>
>> >>>> On Oct 27, 2016, at 4:03 PM, Zelaine Fong <zfong@maprtech.com>
wrote:
>> >>>>
>> >>>> Vitalii -- are you still planning to open a ticket and pull request
>> for
>> >>> the
>> >>>> fix you've noted below?
>> >>>>
>> >>>> -- Zelaine
>> >>>>
>> >>>> On Wed, Oct 26, 2016 at 8:28 AM, Vitalii Diravka <
>> >>> vitalii.diravka@gmail.com>
>> >>>> wrote:
>> >>>>
>> >>>>> @Paul Rogers
>> >>>>> It may be the undefined case when the file is generated with
>> >>> drill.version
>> >>>>> = 1.9-SNAPSHOT.
>> >>>>> It is more easy to determine corrupted date with this flag and
there
>> is
>> >>> no
>> >>>>> need to wait the end of release to merge these changes.
>> >>>>>
>> >>>>> @Jinfeng NI
>> >>>>> It looks like you are right.
>> >>>>> With consistent mode (isDateCorrect = true) all tests are passed.
So
>> I
>> >>> am
>> >>>>> going to open a jira ticket for it with next changes
>> >>>>> https://github.com/vdiravka/drill/commit/
>> ff8d5c7d601915f760d1b0e9618730
>> >>>>> 3410cac5d3
>> >>>>> Thanks.
>> >>>>>
>> >>>>> Kind regards
>> >>>>> Vitalii
>> >>>>>
>> >>>>> 2016-10-25 18:36 GMT+00:00 Jinfeng Ni <jni@apache.org>:
>> >>>>>
>> >>>>>> I'm not sure if I fully understand your answers. The bottom
line is
>> >>>>>> quite simple: given a set of parquet files, the ParquetTableMeta
>> >>>>>> instance constructed in Drill should have identical value
for
>> >>>>>> "isDateCorrect", whether it comes from parquet footer, or
parquet
>> >>>>>> metadata cache, or whether there is partition pruning or
not.
>> However,
>> >>>>>> the code shows that this flag is not in consistent mode
across
>> >>>>>> different cases.
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>> On Tue, Oct 25, 2016 at 11:24 AM, Vitalii Diravka
>> >>>>>> <vitalii.diravka@gmail.com> wrote:
>> >>>>>>> Hi Jinfeng,
>> >>>>>>>
>> >>>>>>> 1.If the parquet files are generated with Drill after
Drill-4203
>> these
>> >>>>>>> files have "isDateCorrect = true" property.
>> >>>>>>> Drill serializes this property from metadata now. When
we set this
>> >>>>>> property
>> >>>>>>> in the first constructor we will hide the value from
metadata.
>> >>>>>>> IsDateCorrect will be false only if this value equals
to the false
>> (no
>> >>>>>> case
>> >>>>>>> for it now) or absent in parquet metadata footer.
>> >>>>>>>
>> >>>>>>>
>> >>>>>>> 2. I'm not sure the reason to change isDateCorrect metadata
>> property
>> >>>>> when
>> >>>>>>> the user disable dates correction.
>> >>>>>>> If you have some use case it would be great if you provide
it.
>> >>>>>>>
>> >>>>>>> 3. Maybe you are right regarding to when Parquet metadata
is
>> cloned.
>> >>>>>>> Here I added the property in the same manner as Jason's
new
>> property
>> >>>>>>> "drillVersion. So need it a separate unit test?
>> >>>>>>>
>> >>>>>>>
>> >>>>>>> Kind regards
>> >>>>>>> Vitalii
>> >>>>>>>
>> >>>>>>> 2016-10-25 16:23 GMT+00:00 Jinfeng Ni <jni@apache.org>:
>> >>>>>>>
>> >>>>>>>> Forgot to copy the link to the code.
>> >>>>>>>>
>> >>>>>>>> [1] https://github.com/apache/drill/blob/master/exec/java-
>> >>>>>>>> exec/src/main/java/org/apache/drill/exec/store/parquet/
>> >>>>>>>> Metadata.java#L950-L955
>> >>>>>>>>
>> >>>>>>>> On Tue, Oct 25, 2016 at 9:16 AM, Jinfeng Ni <jni@apache.org>
>> wrote:
>> >>>>>>>>> @Jason, @Vitalli,
>> >>>>>>>>>
>> >>>>>>>>> Any thoughts on this question, since both you
worked on fix of
>> >>>>>>>> DRILL-4203?
>> >>>>>>>>>
>> >>>>>>>>> Looking through the code, there is a third case
[1], where this
>> flag
>> >>>>>>>>> is set to false when Parquet metadata is cloned
(after partition
>> >>>>>>>>> pruning, etc).  That means, for the 2nd case
where the flag is
>> set
>> >>>>> to
>> >>>>>>>>> true, if there is pruning happening, the new
parquet metadata
>> will
>> >>>>> see
>> >>>>>>>>> the flag is flipped to false. This does not
make sense to me.
>> >>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>> On Mon, Oct 24, 2016 at 3:10 PM, Jinfeng Ni
<jni@apache.org>
>> wrote:
>> >>>>>>>>>> Hello All,
>> >>>>>>>>>>
>> >>>>>>>>>> DRILL-4203 addressed the date field issue.
 In the fix, it
>> >>>>> introduced
>> >>>>>>>>>> a new field in ParquetTableMetadata_v2 :
isDateCorrect.  I have
>> >>>>> some
>> >>>>>>>>>> difficulty in understanding the meaning
of this field.
>> >>>>>>>>>>
>> >>>>>>>>>> According to [1], this field is set to false,
when Drill gets
>> >>>>> parquet
>> >>>>>>>>>> metadata from parquet footer.  This field
is  set to true in
>> code
>> >>>>>> flow
>> >>>>>>>>>> of [2] and [3], when Drill gets parquet
metadata from meta data
>> >>>>>> cache.
>> >>>>>>>>>>
>> >>>>>>>>>> Questions I have:
>> >>>>>>>>>> 1.  If the parquet files are generated with
Drill after
>> DRILL-4203,
>> >>>>>>>>>> Drill still thinks date field is NOT correct
(isDateCorrect =
>> >>>>> false)?
>> >>>>>>>>>> 2.  Why does this filed have nothing to
do with "autoCorrection"
>> >>>>> flag
>> >>>>>>>>>> [4]?  If someone turns off autoCorrection,
will it have impact
>> on
>> >>>>>> this
>> >>>>>>>>>> "isDateCorrect" flag ?
>> >>>>>>>>>>
>> >>>>>>>>>> Thanks in advance for any input,
>> >>>>>>>>>>
>> >>>>>>>>>> Jinfeng
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>> [1] https://github.com/apache/drill/blob/master/exec/java-
>> >>>>>>>> exec/src/main/java/org/apache/drill/exec/store/parquet/
>> >>>>>> Metadata.java#L932
>> >>>>>>>>>> [2] https://github.com/apache/drill/blob/master/exec/java-
>> >>>>>>>> exec/src/main/java/org/apache/drill/exec/store/parquet/
>> >>>>>> Metadata.java#L936
>> >>>>>>>>>> [3] https://github.com/apache/drill/blob/master/exec/java-
>> >>>>>>>> exec/src/main/java/org/apache/drill/exec/store/parquet/
>> >>>>>> Metadata.java#L187
>> >>>>>>>>>> [4] https://github.com/apache/drill/blob/master/exec/java-
>> >>>>>>>> exec/src/main/java/org/apache/drill/exec/store/parquet/
>> >>>>>>>> Metadata.java#L354-L355
>> >>>>>>>>
>> >>>>>>
>> >>>>>
>> >>>
>> >>>
>> >
>>

Mime
View raw message