drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Rogers <prog...@maprtech.com>
Subject Re: isDateCorrect field in ParquetTableMetadata
Date Fri, 28 Oct 2016 16:26:08 GMT
Thanks Vitalii.

The Parquet Writer solution “just works”. As soon as someone upgrades the writer, files
are labeled as having that new version. No fuzziness during a release as in 1.9.

It is fine to also include the Drill version. But, format decisions should be keyed off of
the writer version.

By the way, do other tools happen to already do this? It would be rather surprising if they
didn’t.

- Paul

> On Oct 28, 2016, at 8:30 AM, Vitalii Diravka <vitalii.diravka@gmail.com> wrote:
> 
> I agree that it would be good if the approach of parquet date correctness
> detection will be upgraded. So I created the jira for it DRILL-4980
> <https://issues.apache.org/jira/browse/DRILL-4980>.
> 
> But now we have two ideas:
> 1. To add checking of the drill version additionally, so later we can
> delete isDateCorrect label from parquet metadata.
> 2. To add parquet writer version to the parquet metadata and check this
> value instead of isDateCorrect and drillVersion.
> 
> So which way, we should prefer now?
> 
> Kind regards
> Vitalii
> 
> 2016-10-27 23:54 GMT+00:00 Paul Rogers <progers@maprtech.com>:
> 
>> FWIW: back on the magic flag issue…
>> 
>> I noted Vitali’s concern about “1.9” and “1.9-SNAPSHOT” being too course
>> grained for our needs.
>> 
>> A typical solution is include the version of the Parquet writer in
>> addition to that of Drill. Each time we change something in the writer,
>> increment the version number. If we number changes, we can easily handle
>> two changes in the same Drill release, or differentiate between the “early
>> 1.9” files with old-style dates and “late 1.9” files with correct dates.
>> 
>> Since we have no version now, start it at some arbitrary point (2?).
>> 
>> Now, if the Parquet file has a Drill Writer version in the header, and
>> that version is 2 or greater, the date is in the “correct” format. Anything
>> written by Drill before writer version 2, the date is wrong. The “check the
>> data to see if it is sane” approach is needed only for files were we can’t
>> tell if an older Drill wrote it.
>> 
>> Do other tools label the data? Does Hive say that it wrote the file? If
>> so, we don’t need to do the sanity check if we can tell the data comes from
>> Hive (or Impala, or anything other than old Drill.)
>> 
>> - Paul
>> 
>>> On Oct 27, 2016, at 4:03 PM, Zelaine Fong <zfong@maprtech.com> wrote:
>>> 
>>> Vitalii -- are you still planning to open a ticket and pull request for
>> the
>>> fix you've noted below?
>>> 
>>> -- Zelaine
>>> 
>>> On Wed, Oct 26, 2016 at 8:28 AM, Vitalii Diravka <
>> vitalii.diravka@gmail.com>
>>> wrote:
>>> 
>>>> @Paul Rogers
>>>> It may be the undefined case when the file is generated with
>> drill.version
>>>> = 1.9-SNAPSHOT.
>>>> It is more easy to determine corrupted date with this flag and there is
>> no
>>>> need to wait the end of release to merge these changes.
>>>> 
>>>> @Jinfeng NI
>>>> It looks like you are right.
>>>> With consistent mode (isDateCorrect = true) all tests are passed. So I
>> am
>>>> going to open a jira ticket for it with next changes
>>>> https://github.com/vdiravka/drill/commit/ff8d5c7d601915f760d1b0e9618730
>>>> 3410cac5d3
>>>> Thanks.
>>>> 
>>>> Kind regards
>>>> Vitalii
>>>> 
>>>> 2016-10-25 18:36 GMT+00:00 Jinfeng Ni <jni@apache.org>:
>>>> 
>>>>> I'm not sure if I fully understand your answers. The bottom line is
>>>>> quite simple: given a set of parquet files, the ParquetTableMeta
>>>>> instance constructed in Drill should have identical value for
>>>>> "isDateCorrect", whether it comes from parquet footer, or parquet
>>>>> metadata cache, or whether there is partition pruning or not. However,
>>>>> the code shows that this flag is not in consistent mode across
>>>>> different cases.
>>>>> 
>>>>> 
>>>>> 
>>>>> On Tue, Oct 25, 2016 at 11:24 AM, Vitalii Diravka
>>>>> <vitalii.diravka@gmail.com> wrote:
>>>>>> Hi Jinfeng,
>>>>>> 
>>>>>> 1.If the parquet files are generated with Drill after Drill-4203
these
>>>>>> files have "isDateCorrect = true" property.
>>>>>> Drill serializes this property from metadata now. When we set this
>>>>> property
>>>>>> in the first constructor we will hide the value from metadata.
>>>>>> IsDateCorrect will be false only if this value equals to the false
(no
>>>>> case
>>>>>> for it now) or absent in parquet metadata footer.
>>>>>> 
>>>>>> 
>>>>>> 2. I'm not sure the reason to change isDateCorrect metadata property
>>>> when
>>>>>> the user disable dates correction.
>>>>>> If you have some use case it would be great if you provide it.
>>>>>> 
>>>>>> 3. Maybe you are right regarding to when Parquet metadata is cloned.
>>>>>> Here I added the property in the same manner as Jason's new property
>>>>>> "drillVersion. So need it a separate unit test?
>>>>>> 
>>>>>> 
>>>>>> Kind regards
>>>>>> Vitalii
>>>>>> 
>>>>>> 2016-10-25 16:23 GMT+00:00 Jinfeng Ni <jni@apache.org>:
>>>>>> 
>>>>>>> Forgot to copy the link to the code.
>>>>>>> 
>>>>>>> [1] https://github.com/apache/drill/blob/master/exec/java-
>>>>>>> exec/src/main/java/org/apache/drill/exec/store/parquet/
>>>>>>> Metadata.java#L950-L955
>>>>>>> 
>>>>>>> On Tue, Oct 25, 2016 at 9:16 AM, Jinfeng Ni <jni@apache.org>
wrote:
>>>>>>>> @Jason, @Vitalli,
>>>>>>>> 
>>>>>>>> Any thoughts on this question, since both you worked on fix
of
>>>>>>> DRILL-4203?
>>>>>>>> 
>>>>>>>> Looking through the code, there is a third case [1], where
this flag
>>>>>>>> is set to false when Parquet metadata is cloned (after partition
>>>>>>>> pruning, etc).  That means, for the 2nd case where the flag
is set
>>>> to
>>>>>>>> true, if there is pruning happening, the new parquet metadata
will
>>>> see
>>>>>>>> the flag is flipped to false. This does not make sense to
me.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Mon, Oct 24, 2016 at 3:10 PM, Jinfeng Ni <jni@apache.org>
wrote:
>>>>>>>>> Hello All,
>>>>>>>>> 
>>>>>>>>> DRILL-4203 addressed the date field issue.  In the fix,
it
>>>> introduced
>>>>>>>>> a new field in ParquetTableMetadata_v2 : isDateCorrect.
 I have
>>>> some
>>>>>>>>> difficulty in understanding the meaning of this field.
>>>>>>>>> 
>>>>>>>>> According to [1], this field is set to false, when Drill
gets
>>>> parquet
>>>>>>>>> metadata from parquet footer.  This field is  set to
true in code
>>>>> flow
>>>>>>>>> of [2] and [3], when Drill gets parquet metadata from
meta data
>>>>> cache.
>>>>>>>>> 
>>>>>>>>> Questions I have:
>>>>>>>>> 1.  If the parquet files are generated with Drill after
DRILL-4203,
>>>>>>>>> Drill still thinks date field is NOT correct (isDateCorrect
=
>>>> false)?
>>>>>>>>> 2.  Why does this filed have nothing to do with "autoCorrection"
>>>> flag
>>>>>>>>> [4]?  If someone turns off autoCorrection, will it have
impact on
>>>>> this
>>>>>>>>> "isDateCorrect" flag ?
>>>>>>>>> 
>>>>>>>>> Thanks in advance for any input,
>>>>>>>>> 
>>>>>>>>> Jinfeng
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> [1] https://github.com/apache/drill/blob/master/exec/java-
>>>>>>> exec/src/main/java/org/apache/drill/exec/store/parquet/
>>>>> Metadata.java#L932
>>>>>>>>> [2] https://github.com/apache/drill/blob/master/exec/java-
>>>>>>> exec/src/main/java/org/apache/drill/exec/store/parquet/
>>>>> Metadata.java#L936
>>>>>>>>> [3] https://github.com/apache/drill/blob/master/exec/java-
>>>>>>> exec/src/main/java/org/apache/drill/exec/store/parquet/
>>>>> Metadata.java#L187
>>>>>>>>> [4] https://github.com/apache/drill/blob/master/exec/java-
>>>>>>> exec/src/main/java/org/apache/drill/exec/store/parquet/
>>>>>>> Metadata.java#L354-L355
>>>>>>> 
>>>>> 
>>>> 
>> 
>> 


Mime
View raw message