Paul Rogers <prog...@mapr.com>
Thinking about Drill 2.0
Thu, 15 Jun 2017 17:39:41 GMT
Hi Uwe,

This is incredibly helpful information! You explanation makes perfect sense.

We work quite a bit with ODBC and JDBC: two interfaces that are very much synchronous and
row-based. There are three challenges key with working with Drill:

* Drill results are columnar, requiring a column-to-row translation for xDBC
* Drill uses an asynchronous API, while JDBC and ODBC are synchronous, resulting in an async-to-sync
API translation.
* The JDBC API is based on the Drill client which requires quite a bit (almost all, really)
of Drill code.

The thought is to create a new API that serves the need of ODBC and JDBC, but without the
complexity (while, of course, preserving the existing client for other uses.) Said another
way, find a way to keep the xDBC interfaces simple so that they don’t take quite so much
space in the client, and don’t require quite so much work to maintain.

The first issue (row vs. columnar) turns out to not be a huge issue, the columnar-to-row translation
code exists and works. The real issue is allowing the client to the size of the data sent
from the server. (At present, the server decides the “batch” size, and sometimes the size
is huge.) So, we can just focus on controlling batch size (and thus client buffer allocations),
but retain the columnar form, even for ODBC and JDBC.

So, for the Pandas use case, does your code allow (or benefit from) multiple simultaneous
queries over the same connection? Or, since Python seems to be only approximately multi-threaded,
would a synchronous, columnar API work better? Here I just mean, in a single connection, is
there a need to run multiple concurrent queries, or is the classic one-concurrent-query-per-connection
model easier for Python to consume?

Another point you raise is that our client-side column format should be Arrow, or Arrow-compatible.
(That is, either using Arrow code, or the same data format as Arrow.) That way users of your
work can easily leverage Drill.

This last question raises an interesting issue that I (at least) need to understand more clearly.
Is Arrow a data format + code? Or, is the data format one aspect of Arrow, and the implementation
another? Would be great to have a common data format, but as we squeeze ever more performance
from Drill, we find we have to very carefully tune our data manipulation code for the specific
needs of Drill queries. I wonder how we’d do that if we switched to using Arrow’s generic
vector implementation code? Has anyone else wrestled with this question for your project?


- Paul

> On Jun 15, 2017, at 12:23 AM, Uwe L. Korn <uwelk@xhochy.com> wrote:
> Hello Paul,
> Bringing in a bit of the perspective partly of an Arrow developer but mostly someone
that works quite a lot in Python with the respective data libraries there: In Python all (performant)
data chrunching work is done on columar representations. While this is partly due to columnar
being a more CPU efficient on these tasks, this is also because columnar can be abstracted
in a form that you implement all computational work with C/C++ or an LLVM-based JIT while
still keeping clear and understandable interfaces in Python. In the end to make an efficient
Python support, we will always have to convert into a columnar representation, making row-wise
APIs to a system that is internally columnar quite annoying as we have a lot of wastage in
the conversion layer. In the case that one would want to provide the ability to support Python
UDFs, this would lead to the situation that in most cases the UDF calls will be greatly dominated
by the conversion logic.
> For the actual performance differences that this makes, you can have a look at the work
that recently is happening in Apache Spark where Arrow is used for the conversion of the result
from Spark's internal JVM data structures into typical Python ones ("Pandas DataFrames").
In comparision to the existing conversion, this sees currently a speedup of 40x but will be
even higher once further steps are implemented. Julien should be able to provide a link to
slides that outline the work better.
> As I'm quite new to Drill, I cannot go into much further details w.r.t. Drill but be
aware that for languages like Python, having a columnar API really matters. While Drill integrates
with Python at the moment not really as a first class citizen, moving to row-wise APIs won't
probably make a difference to the current situation but good columnar APIs would help us to
keep the path open for the future.
> Uwe

