drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Vitalii Diravka <vitalii.dira...@gmail.com>
Subject Re: isDateCorrect field in ParquetTableMetadata
Date Fri, 28 Oct 2016 20:03:44 GMT
I explored metadata of parquet files generated from different tools:

* Impala:*
creator:         impala version 2.2.0-cdh5.4.5 (build
4a81c1d04c39961ef14ff6131d543dd96ef60e6e)

*Hive:*
creator:       parquet-mr version 1.6.0

*Pig:*
creator:     parquet-mr version 1.5.1-SNAPSHOT
extra:       pig.schema = recipe: chararray,ingredients: {(name:
chararray)},inventor: (name: chararray,age: int)

*Spark:*
creator:          parquet-mr
extra:            parquet.proto.descriptor = name:
"ArrayWithNestedGroupAndArray" field { name: "primitive" number: 1 label:
LABEL_OPTIONAL type: TYPE_INT32 } field { name: "myComplex" number: 2
label: LABEL_REPEATED type: TYPE_MESSAGE type_name:
".TestProtobuf.MyComplex" }
extra:            parquet.proto.class =
parquet.proto.test.TestProtobuf$ArrayWithNestedGroupAndArray


*Drill (now):*creator:     parquet-mr version 1.8.1-drill-r0 (build
6b605a4ea05b66e1a6bf843353abcb4834a4ced8)
extra:       drill.version = 1.9.0-SNAPSHOT
extra:       is.date.correct = true

So we can replace second extra with "parquet-writer.version = 2.0.0".
Thoughts?


Kind regards
Vitalii

2016-10-28 16:26 GMT+00:00 Paul Rogers <progers@maprtech.com>:

> Thanks Vitalii.
>
> The Parquet Writer solution “just works”. As soon as someone upgrades the
> writer, files are labeled as having that new version. No fuzziness during a
> release as in 1.9.
>
> It is fine to also include the Drill version. But, format decisions should
> be keyed off of the writer version.
>
> By the way, do other tools happen to already do this? It would be rather
> surprising if they didn’t.
>
> - Paul
>
> > On Oct 28, 2016, at 8:30 AM, Vitalii Diravka <vitalii.diravka@gmail.com>
> wrote:
> >
> > I agree that it would be good if the approach of parquet date correctness
> > detection will be upgraded. So I created the jira for it DRILL-4980
> > <https://issues.apache.org/jira/browse/DRILL-4980>.
> >
> > But now we have two ideas:
> > 1. To add checking of the drill version additionally, so later we can
> > delete isDateCorrect label from parquet metadata.
> > 2. To add parquet writer version to the parquet metadata and check this
> > value instead of isDateCorrect and drillVersion.
> >
> > So which way, we should prefer now?
> >
> > Kind regards
> > Vitalii
> >
> > 2016-10-27 23:54 GMT+00:00 Paul Rogers <progers@maprtech.com>:
> >
> >> FWIW: back on the magic flag issue…
> >>
> >> I noted Vitali’s concern about “1.9” and “1.9-SNAPSHOT” being too
course
> >> grained for our needs.
> >>
> >> A typical solution is include the version of the Parquet writer in
> >> addition to that of Drill. Each time we change something in the writer,
> >> increment the version number. If we number changes, we can easily handle
> >> two changes in the same Drill release, or differentiate between the
> “early
> >> 1.9” files with old-style dates and “late 1.9” files with correct dates.
> >>
> >> Since we have no version now, start it at some arbitrary point (2?).
> >>
> >> Now, if the Parquet file has a Drill Writer version in the header, and
> >> that version is 2 or greater, the date is in the “correct” format.
> Anything
> >> written by Drill before writer version 2, the date is wrong. The “check
> the
> >> data to see if it is sane” approach is needed only for files were we
> can’t
> >> tell if an older Drill wrote it.
> >>
> >> Do other tools label the data? Does Hive say that it wrote the file? If
> >> so, we don’t need to do the sanity check if we can tell the data comes
> from
> >> Hive (or Impala, or anything other than old Drill.)
> >>
> >> - Paul
> >>
> >>> On Oct 27, 2016, at 4:03 PM, Zelaine Fong <zfong@maprtech.com> wrote:
> >>>
> >>> Vitalii -- are you still planning to open a ticket and pull request for
> >> the
> >>> fix you've noted below?
> >>>
> >>> -- Zelaine
> >>>
> >>> On Wed, Oct 26, 2016 at 8:28 AM, Vitalii Diravka <
> >> vitalii.diravka@gmail.com>
> >>> wrote:
> >>>
> >>>> @Paul Rogers
> >>>> It may be the undefined case when the file is generated with
> >> drill.version
> >>>> = 1.9-SNAPSHOT.
> >>>> It is more easy to determine corrupted date with this flag and there
> is
> >> no
> >>>> need to wait the end of release to merge these changes.
> >>>>
> >>>> @Jinfeng NI
> >>>> It looks like you are right.
> >>>> With consistent mode (isDateCorrect = true) all tests are passed. So
I
> >> am
> >>>> going to open a jira ticket for it with next changes
> >>>> https://github.com/vdiravka/drill/commit/
> ff8d5c7d601915f760d1b0e9618730
> >>>> 3410cac5d3
> >>>> Thanks.
> >>>>
> >>>> Kind regards
> >>>> Vitalii
> >>>>
> >>>> 2016-10-25 18:36 GMT+00:00 Jinfeng Ni <jni@apache.org>:
> >>>>
> >>>>> I'm not sure if I fully understand your answers. The bottom line
is
> >>>>> quite simple: given a set of parquet files, the ParquetTableMeta
> >>>>> instance constructed in Drill should have identical value for
> >>>>> "isDateCorrect", whether it comes from parquet footer, or parquet
> >>>>> metadata cache, or whether there is partition pruning or not.
> However,
> >>>>> the code shows that this flag is not in consistent mode across
> >>>>> different cases.
> >>>>>
> >>>>>
> >>>>>
> >>>>> On Tue, Oct 25, 2016 at 11:24 AM, Vitalii Diravka
> >>>>> <vitalii.diravka@gmail.com> wrote:
> >>>>>> Hi Jinfeng,
> >>>>>>
> >>>>>> 1.If the parquet files are generated with Drill after Drill-4203
> these
> >>>>>> files have "isDateCorrect = true" property.
> >>>>>> Drill serializes this property from metadata now. When we set
this
> >>>>> property
> >>>>>> in the first constructor we will hide the value from metadata.
> >>>>>> IsDateCorrect will be false only if this value equals to the
false
> (no
> >>>>> case
> >>>>>> for it now) or absent in parquet metadata footer.
> >>>>>>
> >>>>>>
> >>>>>> 2. I'm not sure the reason to change isDateCorrect metadata
property
> >>>> when
> >>>>>> the user disable dates correction.
> >>>>>> If you have some use case it would be great if you provide it.
> >>>>>>
> >>>>>> 3. Maybe you are right regarding to when Parquet metadata is
cloned.
> >>>>>> Here I added the property in the same manner as Jason's new
property
> >>>>>> "drillVersion. So need it a separate unit test?
> >>>>>>
> >>>>>>
> >>>>>> Kind regards
> >>>>>> Vitalii
> >>>>>>
> >>>>>> 2016-10-25 16:23 GMT+00:00 Jinfeng Ni <jni@apache.org>:
> >>>>>>
> >>>>>>> Forgot to copy the link to the code.
> >>>>>>>
> >>>>>>> [1] https://github.com/apache/drill/blob/master/exec/java-
> >>>>>>> exec/src/main/java/org/apache/drill/exec/store/parquet/
> >>>>>>> Metadata.java#L950-L955
> >>>>>>>
> >>>>>>> On Tue, Oct 25, 2016 at 9:16 AM, Jinfeng Ni <jni@apache.org>
> wrote:
> >>>>>>>> @Jason, @Vitalli,
> >>>>>>>>
> >>>>>>>> Any thoughts on this question, since both you worked
on fix of
> >>>>>>> DRILL-4203?
> >>>>>>>>
> >>>>>>>> Looking through the code, there is a third case [1],
where this
> flag
> >>>>>>>> is set to false when Parquet metadata is cloned (after
partition
> >>>>>>>> pruning, etc).  That means, for the 2nd case where the
flag is set
> >>>> to
> >>>>>>>> true, if there is pruning happening, the new parquet
metadata will
> >>>> see
> >>>>>>>> the flag is flipped to false. This does not make sense
to me.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On Mon, Oct 24, 2016 at 3:10 PM, Jinfeng Ni <jni@apache.org>
> wrote:
> >>>>>>>>> Hello All,
> >>>>>>>>>
> >>>>>>>>> DRILL-4203 addressed the date field issue.  In the
fix, it
> >>>> introduced
> >>>>>>>>> a new field in ParquetTableMetadata_v2 : isDateCorrect.
 I have
> >>>> some
> >>>>>>>>> difficulty in understanding the meaning of this
field.
> >>>>>>>>>
> >>>>>>>>> According to [1], this field is set to false, when
Drill gets
> >>>> parquet
> >>>>>>>>> metadata from parquet footer.  This field is  set
to true in code
> >>>>> flow
> >>>>>>>>> of [2] and [3], when Drill gets parquet metadata
from meta data
> >>>>> cache.
> >>>>>>>>>
> >>>>>>>>> Questions I have:
> >>>>>>>>> 1.  If the parquet files are generated with Drill
after
> DRILL-4203,
> >>>>>>>>> Drill still thinks date field is NOT correct (isDateCorrect
=
> >>>> false)?
> >>>>>>>>> 2.  Why does this filed have nothing to do with
"autoCorrection"
> >>>> flag
> >>>>>>>>> [4]?  If someone turns off autoCorrection, will
it have impact on
> >>>>> this
> >>>>>>>>> "isDateCorrect" flag ?
> >>>>>>>>>
> >>>>>>>>> Thanks in advance for any input,
> >>>>>>>>>
> >>>>>>>>> Jinfeng
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> [1] https://github.com/apache/drill/blob/master/exec/java-
> >>>>>>> exec/src/main/java/org/apache/drill/exec/store/parquet/
> >>>>> Metadata.java#L932
> >>>>>>>>> [2] https://github.com/apache/drill/blob/master/exec/java-
> >>>>>>> exec/src/main/java/org/apache/drill/exec/store/parquet/
> >>>>> Metadata.java#L936
> >>>>>>>>> [3] https://github.com/apache/drill/blob/master/exec/java-
> >>>>>>> exec/src/main/java/org/apache/drill/exec/store/parquet/
> >>>>> Metadata.java#L187
> >>>>>>>>> [4] https://github.com/apache/drill/blob/master/exec/java-
> >>>>>>> exec/src/main/java/org/apache/drill/exec/store/parquet/
> >>>>>>> Metadata.java#L354-L355
> >>>>>>>
> >>>>>
> >>>>
> >>
> >>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message