drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Rogers <prog...@mapr.com>
Subject Re: Thinking about Drill 2.0
Date Tue, 13 Jun 2017 04:11:43 GMT
Thanks for the suggestions!

The issue is only partly Calcite changes. The real challenge for potential contributors is
that the Drill storage plugin exposes Calcite mechanisms directly. That is, to write storage
plugin, one must know (or, more likely, experiment to learn) the odd set of calls made to
the storage plugin, for a group scan, then a sub scan, then this or that. Then, learning those
calls, map what you want to do to those calls. In some cases, as Calcite chugs along, it calls
the same methods multiple times, so the plugin writer has to be prepared to implement caching
to avoid banging on the underlying system multiple times for the same data.

The key opportunity here is to observe that the current API is at the implementation level:
as callbacks from Calcite. (Though, the Drill “easy” storage plugin does hide some of
the details.) Instead, we’d like an API at the definition level: that the plugin simply
declares that, say, it can return a schema, or can handle certain kinds of filter push-down,

If we can define that API at the metadata (planning) level, then we can create an adapter
between that API and Calcite. Doing so makes it much easier to test the plugin, and isolates
the plugin from future code changes as Calcite evolves and improves: the adapter changes but
not the plugin metadata API.

As you suggest, the resulting definition API would be handy to share between projects.

On the execution side, however, Drill plugins are very specific to Drill’s operator framework,
Drill’s schema-on-read mechanism, Drill’s special columns (file metadata, partitions),
Drill’s vector “mutators” and so on. Here, any synergy would be with Arrow to define
a common “mutator” API so that a “row batch reader” written for one system should
work with the other.

In any case, this kind of sharing is hard to define up front, we might instead keep the discussion
going to see what works for Drill, what we can abstract out, and how we can make the common
abstraction work for other systems beyond Drill.


- Paul

> On Jun 9, 2017, at 3:38 PM, Julian Hyde <jhyde@apache.org> wrote:
>> On Jun 5, 2017, at 11:59 AM, Paul Rogers <progers@mapr.com> wrote:
>> Similarly, the storage plugin API exposes details of Calcite (which seems to evolve
with each new version), exposes value vector implementations, and so on. A cleaner, simpler,
more isolated API will allow storage plugins to be built faster, but will also isolate them
from Drill internals changes. Without isolation, each change to Drill internals would require
plugin authors to update their plugin before Drill can be released.
> Sorry you’re getting burned by Calcite changes. We try to minimize impact, but sometimes
it’s difficult to see what you’re breaking.
> I like the goal of a stable storage plugin API. Maybe it’s something Drill and Calcite
can collaborate on? Much of the DNA of an adapter is independent of the engine that will consume
the data (Drill or otherwise) - it concerns how to create a connection, getting metadata,
and pushing down logical operations, and generating queries in the target system’s query
language. Calcite and Drill ought to be able to share that part, rather than maintaining separate
collections of adapters.
> Julian

View raw message