drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Aman Sinha <amansi...@apache.org>
Subject Re: Drill 2.0 (design) hackathon
Date Wed, 06 Sep 2017 05:47:46 GMT
Here is the Eventbrite event for registration:

https://www.eventbrite.com/e/drill-developer-day-sept-2017-registration-7478463285

Please register so we can plan for food and drinks appropriately.

The link also contains a google doc link for the preliminary agenda and a
'Topics' tab with volunteer sign-up column.  Please add your name to the
area(s) of interest.

Thanks and look forward to seeing you all !

-Aman

On Wed, Aug 30, 2017 at 9:44 AM, Paul Rogers <progers@mapr.com> wrote:

> A partial list of Drill’s public APIs:
>
> IMHO, highest priority for Drill 2.0.
>
>
>   *   JDBC/ODBC drivers
>   *   Client (for JDBC/ODBC) + ODBC & JDBC
>   *   Client (for full Drill async, columnar)
>   *   Storage plugin
>   *   Format plugin
>   *   System/session options
>   *   Queueing (e.g. ZK-based queues)
>   *   Rest API
>   *   Resource Planning (e.g. max query memory per node)
>   *   Metadata access, storage (e.g. file system locations vs. a metastore)
>   *   Metadata files formats (Parquet, views, etc.)
>
> Lower priority for future releases:
>
>
>   *   Query Planning (e.g. Calcite rules)
>   *   Config options
>   *   SQL syntax, especially Drill extensions
>   *   UDF
>   *   Management (e.g. JMX, Rest API calls, etc.)
>   *   Drill File System (HDFS)
>   *   Web UI
>   *   Shell scripts
>
> There are certainly more. Please suggest those that are missing. I’ve
> taken a rough cut at which APIs need forward/backward compatibility first,
> in part based on those that are the “most public” and most likely to
> change. Others are important, but we can’t do them all at once.
>
> Thanks,
>
> - Paul
>
> On Aug 29, 2017, at 6:00 PM, Aman Sinha <amansinha@apache.org<mailto:a
> mansinha@apache.org>> wrote:
>
> Hi Paul,
> certainly makes sense to have the API compatibility discussions during this
> hackathon.  The 2.0 release may be a good checkpoint to introduce breaking
> changes necessitating changes to the ODBC/JDBC drivers and other external
> applications. As part of this exercise (not during the hackathon but as a
> follow-up action), we also should clearly identify the "public" interfaces.
>
>
> I will add this to the agenda.
>
> thanks,
> -Aman
>
> On Tue, Aug 29, 2017 at 2:08 PM, Paul Rogers <progers@mapr.com<mailto:
> progers@mapr.com>> wrote:
>
> Thanks Aman for organizing the Hackathon!
>
> The list included many good ideas for Drill 2.0. Some of those require
> changes to Drill’s “public” interfaces (file format, client protocol, SQL
> behavior, etc.)
>
> At present, Drill has no good mechanism to handle backward/forward
> compatibility at the API level. Protobuf versioning certainly helps, but
> can’t completely solve semantic changes (where a field changes meaning, or
> a non-Protobuf data chunk changes format.) As just one concrete example,
> changing to Arrow will break pre-Arrow ODBC/JDBC drivers because class
> names and data formats will change.
>
> Perhaps we can prioritize, for the proposed 2.0 release, a one-time set of
> breaking changes that introduce a versioning mechanism into our public
> APIs. Once these are in place, we can evolve the APIs in the future by
> following the newly-created versioning protocol.
>
> Without such a mechanism, we cannot support old & new clients in the same
> cluster. Nor can we support rolling upgrades. Of course, another solution
> is to get it right the second time, then freeze all APIs and agree to never
> again change them. Not sure we have sufficient access to a crystal ball to
> predict everything we’d ever need in our APIs, however...
>
> Thanks,
>
> - Paul
>
> On Aug 24, 2017, at 8:39 AM, Aman Sinha <amansinha@apache.org<mailto:a
> mansinha@apache.org>> wrote:
>
> Drill Developers,
>
> In order to kick-start the Drill 2.0  release discussions, I would like
> to
> propose a Drill 2.0  (design) hackathon (a.k.a Drill Developer Day ™ J ).
>
> As I mentioned in the hangout on Tuesday,  MapR has offered to host it on
> Sept 18th at their offices at 350 Holger Way, San Jose.   Hope that works
> for most of you!
>
> The goal is to get the community together for a day-long technical
> discussion on key topics in preparation for a Drill 2.0 release as well
> as
> potential improvements in upcoming 1.xx releases.  Depending on the
> interest areas, we could form groups and have a volunteer lead each
> group.
>
> Based on prior discussions on the dev list, hangouts and existing JIRAs,
> there is already a substantial set of topics and I have summarized a few
> of
> them below.   What other topics do folks want to talk about?   Feel free
> to
> respond to this thread and I will create a google doc to consolidate.
> Understandably, the list would be long but we will use the hackathon to
> get
> a sense of a reasonable feature set for 1.xx and 2.0 releases.
>
>
> 1. Metadata management.
>
> 1a: Defining an abstraction layer for various types of metadata: views,
> schema, statistics, security
>
> 1b: Underlying storage for metadata: what are the options and their
> trade-offs?
>
>     - Hive metastore
>
>     - Parquet metadata cache (parquet specific)
>
>     - An embedded DBMS
>
>     - A distributed key-value store
>
>     - Others..
>
>
>
> 2. Drill integration with Apache Arrow
>
> 2a: Evaluate the choices and tradeoffs
>
>
>
> 3. Resource management
>
> 3a: Memory limits per query
>
> 3b: Spilling
>
> 3c: Resource management with Drill on Yarn/Mesos/Kubernetes
>
> 3d: Local vs. global resource management
>
> 3e: Aligning with admission control/queueing
>
>
>
> 4. TPC-DS coverage and related planner/operator enhancements
>
> 4a: Additional set operations: INTERSECT, EXCEPT
>
> 4b: GROUPING SETS, ROLLUP, CUBE support
>
> 4c: Handling inequality joins and cartesian joins of non-scalar inputs
> (via Nested Loop Join)
>
> 4d: Remaining gaps in correlated subquery
>
> 4e: Statistics: Number of Distinct Values, Histograms
>
>
>
> 5. Schema handling
>
> 5a: Creation, management of schema
>
> 5b: Handling schema changes in certain common cases
>
> 5c: Schema-awareness
>
> 5d: Others TBD
>
>
>
> 6. Concurrency
>
> 6a: What are the bottlenecks to achieving higher concurrency
>
> 6b: Ideas to address these..e.g async execution ?
>
>
>
> 7. Storage plugins,  REST APIs related enhancements
>
>   <Topics TBD>
>
>
>
> 8. Performance improvements
>
> 8a: Filter pushdown
>
> 8b: Vectorized Parquet reader
>
> 8c: Code-gen improvements
>
> 8d: Others TBD
>
>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message