drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "John Omernik (JIRA)" <j...@apache.org>
Subject [jira] [Created] (DRILL-5471) Provide better documentation around Parquet, Options and Integration with Arrow
Date Thu, 04 May 2017 11:04:04 GMT
John Omernik created DRILL-5471:

             Summary: Provide better documentation around Parquet, Options and Integration
with Arrow
                 Key: DRILL-5471
                 URL: https://issues.apache.org/jira/browse/DRILL-5471
             Project: Apache Drill
          Issue Type: Improvement
          Components: Documentation
    Affects Versions: 1.10.0
            Reporter: John Omernik

Apache Drill makes heavy use of the Apache Parquet file format.  This is a great thing.  In
addition, with the advent of Apache Arrow, and JIRAs like https://issues.apache.org/jira/browse/DRILL-4455
understanding the integration with projects that are important to Drill (Parquet/Arrow) is
both important and very opaque to end users.  

What do I mean by this? Well that Arrow JIRA is interesting, it looks like there is benefit
to get Drill and Arrow on the same path, yet, asking the community "Is there interest in this?"
is a very difficult proposition. I would love to chime in on this topic, but I don't understand
what is happening enough to make an informed comment.  This is true of Arrow, and it's true
of Parquet. 

For Parquet, there are two readers included in Apache Drill. There are a number of options
for encoding in the writer, there settings for row group sizes, compression, etc.  How do
these all play out?  For end users who are maybe trying to read parquet files created with
older versions of Parquet, or versions of Parquet used by Spark, Impala, Hive etc, how can
we better provide them some things to try to get better performance or troubleshoot errors
in queries?

Yes, there are lots of JIRA and/or code comments around projects, however having better documentation
of where we are now with some of these critical projects (Calcite as well?)  are we using
releases of those projects? Have we rewritten Drills own version (Like a Parquet reader?),
are we on forks of other projects?  Do we have project goals? I.e. Do we believe it would
be a good project goal to work to use a standardized Parquet writer, but still use our reader?
What about the Arrow integration?  What benefits would an end user see? 

For some of these major components, describing what the current challenges are, what other
potential future states could be, and what those futures states could bring the end user could
help users generate interest, or even contribute to moving the future state forward.  In addition,
a page or pages on roadmaps, features, tweaks etc in the Documentation website could also
help link to relevant JIRAs and provide a way to track progress. 

This message was sent by Atlassian JIRA

View raw message