drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Rogers <prog...@maprtech.com>
Subject Re: isDateCorrect field in ParquetTableMetadata
Date Thu, 27 Oct 2016 23:54:25 GMT
FWIW: back on the magic flag issue…

I noted Vitali’s concern about “1.9” and “1.9-SNAPSHOT” being too course grained
for our needs.

A typical solution is include the version of the Parquet writer in addition to that of Drill.
Each time we change something in the writer, increment the version number. If we number changes,
we can easily handle two changes in the same Drill release, or differentiate between the “early
1.9” files with old-style dates and “late 1.9” files with correct dates.

Since we have no version now, start it at some arbitrary point (2?).

Now, if the Parquet file has a Drill Writer version in the header, and that version is 2 or
greater, the date is in the “correct” format. Anything written by Drill before writer
version 2, the date is wrong. The “check the data to see if it is sane” approach is needed
only for files were we can’t tell if an older Drill wrote it.

Do other tools label the data? Does Hive say that it wrote the file? If so, we don’t need
to do the sanity check if we can tell the data comes from Hive (or Impala, or anything other
than old Drill.)

- Paul
> On Oct 27, 2016, at 4:03 PM, Zelaine Fong <zfong@maprtech.com> wrote:
> Vitalii -- are you still planning to open a ticket and pull request for the
> fix you've noted below?
> -- Zelaine
> On Wed, Oct 26, 2016 at 8:28 AM, Vitalii Diravka <vitalii.diravka@gmail.com>
> wrote:
>> @Paul Rogers
>> It may be the undefined case when the file is generated with drill.version
>> = 1.9-SNAPSHOT.
>> It is more easy to determine corrupted date with this flag and there is no
>> need to wait the end of release to merge these changes.
>> @Jinfeng NI
>> It looks like you are right.
>> With consistent mode (isDateCorrect = true) all tests are passed. So I am
>> going to open a jira ticket for it with next changes
>> https://github.com/vdiravka/drill/commit/ff8d5c7d601915f760d1b0e9618730
>> 3410cac5d3
>> Thanks.
>> Kind regards
>> Vitalii
>> 2016-10-25 18:36 GMT+00:00 Jinfeng Ni <jni@apache.org>:
>>> I'm not sure if I fully understand your answers. The bottom line is
>>> quite simple: given a set of parquet files, the ParquetTableMeta
>>> instance constructed in Drill should have identical value for
>>> "isDateCorrect", whether it comes from parquet footer, or parquet
>>> metadata cache, or whether there is partition pruning or not. However,
>>> the code shows that this flag is not in consistent mode across
>>> different cases.
>>> On Tue, Oct 25, 2016 at 11:24 AM, Vitalii Diravka
>>> <vitalii.diravka@gmail.com> wrote:
>>>> Hi Jinfeng,
>>>> 1.If the parquet files are generated with Drill after Drill-4203 these
>>>> files have "isDateCorrect = true" property.
>>>> Drill serializes this property from metadata now. When we set this
>>> property
>>>> in the first constructor we will hide the value from metadata.
>>>> IsDateCorrect will be false only if this value equals to the false (no
>>> case
>>>> for it now) or absent in parquet metadata footer.
>>>> 2. I'm not sure the reason to change isDateCorrect metadata property
>> when
>>>> the user disable dates correction.
>>>> If you have some use case it would be great if you provide it.
>>>> 3. Maybe you are right regarding to when Parquet metadata is cloned.
>>>> Here I added the property in the same manner as Jason's new property
>>>> "drillVersion. So need it a separate unit test?
>>>> Kind regards
>>>> Vitalii
>>>> 2016-10-25 16:23 GMT+00:00 Jinfeng Ni <jni@apache.org>:
>>>>> Forgot to copy the link to the code.
>>>>> [1] https://github.com/apache/drill/blob/master/exec/java-
>>>>> exec/src/main/java/org/apache/drill/exec/store/parquet/
>>>>> Metadata.java#L950-L955
>>>>> On Tue, Oct 25, 2016 at 9:16 AM, Jinfeng Ni <jni@apache.org> wrote:
>>>>>> @Jason, @Vitalli,
>>>>>> Any thoughts on this question, since both you worked on fix of
>>>>> DRILL-4203?
>>>>>> Looking through the code, there is a third case [1], where this flag
>>>>>> is set to false when Parquet metadata is cloned (after partition
>>>>>> pruning, etc).  That means, for the 2nd case where the flag is set
>> to
>>>>>> true, if there is pruning happening, the new parquet metadata will
>> see
>>>>>> the flag is flipped to false. This does not make sense to me.
>>>>>> On Mon, Oct 24, 2016 at 3:10 PM, Jinfeng Ni <jni@apache.org>
>>>>>>> Hello All,
>>>>>>> DRILL-4203 addressed the date field issue.  In the fix, it
>> introduced
>>>>>>> a new field in ParquetTableMetadata_v2 : isDateCorrect.  I have
>> some
>>>>>>> difficulty in understanding the meaning of this field.
>>>>>>> According to [1], this field is set to false, when Drill gets
>> parquet
>>>>>>> metadata from parquet footer.  This field is  set to true in
>>> flow
>>>>>>> of [2] and [3], when Drill gets parquet metadata from meta data
>>> cache.
>>>>>>> Questions I have:
>>>>>>> 1.  If the parquet files are generated with Drill after DRILL-4203,
>>>>>>> Drill still thinks date field is NOT correct (isDateCorrect =
>> false)?
>>>>>>> 2.  Why does this filed have nothing to do with "autoCorrection"
>> flag
>>>>>>> [4]?  If someone turns off autoCorrection, will it have impact
>>> this
>>>>>>> "isDateCorrect" flag ?
>>>>>>> Thanks in advance for any input,
>>>>>>> Jinfeng
>>>>>>> [1] https://github.com/apache/drill/blob/master/exec/java-
>>>>> exec/src/main/java/org/apache/drill/exec/store/parquet/
>>> Metadata.java#L932
>>>>>>> [2] https://github.com/apache/drill/blob/master/exec/java-
>>>>> exec/src/main/java/org/apache/drill/exec/store/parquet/
>>> Metadata.java#L936
>>>>>>> [3] https://github.com/apache/drill/blob/master/exec/java-
>>>>> exec/src/main/java/org/apache/drill/exec/store/parquet/
>>> Metadata.java#L187
>>>>>>> [4] https://github.com/apache/drill/blob/master/exec/java-
>>>>> exec/src/main/java/org/apache/drill/exec/store/parquet/
>>>>> Metadata.java#L354-L355

View raw message