drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jinfeng Ni <...@apache.org>
Subject Re: Thinking about Drill 2.0
Date Wed, 07 Jun 2017 20:30:45 GMT
Agreed with the two items Parth listed: Schema free support and improve
Drill's execution architecture.

Schema free (or Schema-on-read) is one main feature that differentiates
Drill from other similar projects. There seems to be a still long list to
improve towards fully support of this feature. I feel Arrow integration
probably would help. In that sense,  we need spend effort to investigate
and integrate Drill with Arrow library. (The UnionVector in Drill side
seems to be not fully completed, compared to what Arrow offers).

The second item is also critical, as Drill uses more than necessary
CPU/threads/memory in many cases.

Regarding API or interface, we probably need put into two categories, one
for application (API), the other one for server (SPI).  I would assume
storage plugin or UDF interface would be put into the second category. When
we discuss compatibility, we may have different requirements for different
categories.

Getting off Calcite/Parquet fork is important, but I feel it may not have
to be a prerequisite for 2.0.



On Mon, Jun 5, 2017 at 1:53 PM, Parth Chandra <parthc@apache.org> wrote:

> Adding to my list of things to consider for Drill 2.0,  I would think that
> getting Drill off our forks of Calcite and Parquet should also be a goal,
> though a tactical one.
>
>
>
> On Mon, Jun 5, 2017 at 1:51 PM, Parth Chandra <parthc@apache.org> wrote:
>
> > Nice suggestion Paul, to start a discussion on 2.0 (it's about time). I
> > would like to make this a broader discussion than just APIs, though APIs
> > are a good place to start. In particular. we usually get the opportunity
> to
> > break backward compatibility only for a major release and that is the
> time
> > we have to finalize the APIs.
> >
> > In the broader discussion I feel we also need to consider some other
> > aspects -
> >   1) Formalize Drill's support for schema free operations.
> >   2) Drill's execution engine architecture and it's 'optimistic' use of
> > resources.
> >
> > Re the APIs:
> >   One more public API is the UDFs. This and the storage plugin APIs
> > together are tied at the hip with vectors and memory management. I'm not
> > sure if we can cleanly separate the underlying representation of vectors
> > from the interfaces to these APIs, but I agree we need to clarify this
> > part. For instance, some of the performance benefits in the Parquet scan
> > come from vectorizing writes to the vector especially for null or
> repeated
> > values. We could provide interfaces to provide the same without which the
> > scans would have to be vector-internals aware. The same goes for UDFs.
> > Assuming that a 2.0 goal would be to provide vectorized interfaces for
> > users to write table (or aggregate) UDFs, one now needs a standardized
> data
> > set representation. If you choose this data set representation to be
> > columnar (for better vectorization), will you end up with
> ValueVector/Arrow
> > based RecordBatches? I included Arrow in this since the project is
> > formalizing exactly this requirement.
> >
> > For the client APIs, I believe that ODBC and JDBC drivers initially were
> > written using record based APIs provided by vendors, but to get better
> > performance started to move to working with raw streams coming over the
> > wire (eg TDS with Sybase/MS-SQLServer [1] ). So what Drill does is in
> fact
> > similar to that approach. The client APIs are really thin layers on top
> of
> > the vector data stream and provide row based, read only access to the
> > vector.
> >
> > Lest I begin to sound too contrary,  thank you for starting this
> > discussion. It is really needed!
> >
> > Parth
> >
> >
> >
> >
> >
> >
> >
> > On Mon, Jun 5, 2017 at 11:59 AM, Paul Rogers <progers@mapr.com> wrote:
> >
> >> Hi All,
> >>
> >> A while back there was a discussion about the scope of Drill 2.0. Got me
> >> thinking about possible topics. My two cents:
> >>
> >> Drill 2.0 should focus on making Drill’s external APIs production ready.
> >> This means five things:
> >>
> >> * Clearly identify and define each API.
> >> * (Re)design each API to ensure it fully isolates the client from Drill
> >> internals.
> >> * Ensure the API allows full version compatibility: Allow mixing of
> >> old/new clients and servers with some limits.
> >> * Fully test each API.
> >> * Fully document each API.
> >>
> >> Once client code is isolated from Drill internals, we are free to evolve
> >> the internals in either Drill 2.0 or a later release.
> >>
> >> In my mind, the top APIs to revisit are:
> >>
> >> * The drill client API.
> >> * The storage plugin API.
> >>
> >> (Explanation below.)
> >>
> >> What other APIs should we consider? Here are some examples, please
> >> suggest items you know about:
> >>
> >> * Command line scripts and arguments
> >> * REST API
> >> * Names and contents of system tables
> >> * Structure of the storage plugin configuration JSON
> >> * Structure of the query profile
> >> * Structure of the EXPLAIN PLAN output.
> >> * Semantics of Drill functions, such as the date functions recently
> >> partially fixed by adding “ANSI” alternatives.
> >> * Naming of config and system/session options.
> >> * (Your suggestions here…)
> >>
> >> I’ve taken the liberty of moving some API-breaking tickets in the Apache
> >> Drill JIRA to 2.0. Perhaps we can add others so that we have a good
> >> inventory of 2.0 candidates.
> >>
> >> Here are the reasons for my two suggestions.
> >>
> >> Today, we expose Drill value vectors to the client. This means if we
> want
> >> to enhance anything about Drill’s internal memory format (i.e. value
> >> vectors, such as a possible move to Arrow), we break compatibility with
> old
> >> clients. Using value vectors also means we need a very large percentage
> of
> >> Drill’s internal code on the client in Java or C++. We are learning that
> >> doing so is a challenge.
> >>
> >> A new client API should follow established SQL database tradition: a
> >> synchronous, row-based API designed for versioning, for forward and
> >> backward compatibility, and to support ODBC and JDBC users.
> >>
> >> We can certainly maintain the existing full, async, heavy-weight client
> >> for our tests and for applications that would benefit from it.
> >>
> >> Once we define a new API, we are free to alter Drill’s value vectors to,
> >> say, add the needed null states to fully support JSON, to change offset
> >> vectors to not need n+1 values (which doubles vector size in 64K
> batches),
> >> and so on. Since vectors become private to Drill (or Arrow) after the
> new
> >> client API, we are free to innovate to improve them.
> >>
> >> Similarly, the storage plugin API exposes details of Calcite (which
> seems
> >> to evolve with each new version), exposes value vector implementations,
> and
> >> so on. A cleaner, simpler, more isolated API will allow storage plugins
> to
> >> be built faster, but will also isolate them from Drill internals
> changes.
> >> Without isolation, each change to Drill internals would require plugin
> >> authors to update their plugin before Drill can be released.
> >>
> >> Thoughts? Suggestions?
> >>
> >> Thanks,
> >>
> >> - Paul
> >
> >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message