drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Uwe L. Korn" <uw...@xhochy.com>
Subject Re: Thinking about Drill 2.0
Date Thu, 15 Jun 2017 07:23:03 GMT
Hello Paul,

Bringing in a bit of the perspective partly of an Arrow developer but mostly someone that
works quite a lot in Python with the respective data libraries there: In Python all (performant)
data chrunching work is done on columar representations. While this is partly due to columnar
being a more CPU efficient on these tasks, this is also because columnar can be abstracted
in a form that you implement all computational work with C/C++ or an LLVM-based JIT while
still keeping clear and understandable interfaces in Python. In the end to make an efficient
Python support, we will always have to convert into a columnar representation, making row-wise
APIs to a system that is internally columnar quite annoying as we have a lot of wastage in
the conversion layer. In the case that one would want to provide the ability to support Python
UDFs, this would lead to the situation that in most cases the UDF calls will be greatly dominated
by the conversion logic.

For the actual performance differences that this makes, you can have a look at the work that
recently is happening in Apache Spark where Arrow is used for the conversion of the result
from Spark's internal JVM data structures into typical Python ones ("Pandas DataFrames").
In comparision to the existing conversion, this sees currently a speedup of 40x but will be
even higher once further steps are implemented. Julien should be able to provide a link to
slides that outline the work better.

As I'm quite new to Drill, I cannot go into much further details w.r.t. Drill but be aware
that for languages like Python, having a columnar API really matters. While Drill integrates
with Python at the moment not really as a first class citizen, moving to row-wise APIs won't
probably make a difference to the current situation but good columnar APIs would help us to
keep the path open for the future.


> Am 13.06.2017 um 06:11 schrieb Paul Rogers <progers@mapr.com>:
> Thanks for the suggestions!
> The issue is only partly Calcite changes. The real challenge for potential contributors
is that the Drill storage plugin exposes Calcite mechanisms directly. That is, to write storage
plugin, one must know (or, more likely, experiment to learn) the odd set of calls made to
the storage plugin, for a group scan, then a sub scan, then this or that. Then, learning those
calls, map what you want to do to those calls. In some cases, as Calcite chugs along, it calls
the same methods multiple times, so the plugin writer has to be prepared to implement caching
to avoid banging on the underlying system multiple times for the same data.
> The key opportunity here is to observe that the current API is at the implementation
level: as callbacks from Calcite. (Though, the Drill “easy” storage plugin does hide some
of the details.) Instead, we’d like an API at the definition level: that the plugin simply
declares that, say, it can return a schema, or can handle certain kinds of filter push-down,
> If we can define that API at the metadata (planning) level, then we can create an adapter
between that API and Calcite. Doing so makes it much easier to test the plugin, and isolates
the plugin from future code changes as Calcite evolves and improves: the adapter changes but
not the plugin metadata API.
> As you suggest, the resulting definition API would be handy to share between projects.
> On the execution side, however, Drill plugins are very specific to Drill’s operator
framework, Drill’s schema-on-read mechanism, Drill’s special columns (file metadata, partitions),
Drill’s vector “mutators” and so on. Here, any synergy would be with Arrow to define
a common “mutator” API so that a “row batch reader” written for one system should
work with the other.
> In any case, this kind of sharing is hard to define up front, we might instead keep the
discussion going to see what works for Drill, what we can abstract out, and how we can make
the common abstraction work for other systems beyond Drill.
> Thanks,
> - Paul
>> On Jun 9, 2017, at 3:38 PM, Julian Hyde <jhyde@apache.org> wrote:
>>> On Jun 5, 2017, at 11:59 AM, Paul Rogers <progers@mapr.com> wrote:
>>> Similarly, the storage plugin API exposes details of Calcite (which seems to
evolve with each new version), exposes value vector implementations, and so on. A cleaner,
simpler, more isolated API will allow storage plugins to be built faster, but will also isolate
them from Drill internals changes. Without isolation, each change to Drill internals would
require plugin authors to update their plugin before Drill can be released.
>> Sorry you’re getting burned by Calcite changes. We try to minimize impact, but
sometimes it’s difficult to see what you’re breaking.
>> I like the goal of a stable storage plugin API. Maybe it’s something Drill and
Calcite can collaborate on? Much of the DNA of an adapter is independent of the engine that
will consume the data (Drill or otherwise) - it concerns how to create a connection, getting
metadata, and pushing down logical operations, and generating queries in the target system’s
query language. Calcite and Drill ought to be able to share that part, rather than maintaining
separate collections of adapters.
>> Julian

View raw message