calcite-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Julian Hyde <jh...@apache.org>
Subject Re: Using Calcite for Apache Flink's higher-level languages
Date Thu, 10 Dec 2015 09:29:39 GMT
Stephan,

I’m pleased and flattered to hear that you are considering Calcite. Flink is an excellent
project and we would be excited to work with you folks.

Your plan to translate both table API and SQL to Calcite operator tree makes sense. For the
table API, you could take a look at RelBuilder: it is a concise and user-friendly way of creating
operator trees.

We haven’t done much work on DAGs to date. You can definitely represent a DAG as an operator
tree — because Calcite’s optimizer uses dynamic programming, even a query as simple as
‘select * from t, t’ will create a DAG, with both sides of the join using the same table
scan — but the problem is that the cost will be double-counted: Calcite will cost the plan
as if each path through the plan was evaluated from scratch. We have some ideas how to fix
that, such as a Spool operator [1] that creates temporary tables that, once populated, can
be used multiple times within the plan. I also think approaches based on integer linear programming
[2] are promising.

Regarding extending the SQL grammar. The Drill project put in place a mechanism that uses
FreeMarker as a pre-processor for the grammar source file. A project can contribute a patch
to Calcite that inserts a macro into Calcite's grammar file and defines a default value for
that macro; its own parser would substitute different values in those places. Phoenix has
used the same mechanism. Flink could use this same mechanism to extend Calcite SQL any way
it chooses.

Having said that, we are open to including your extensions in Calcite’s core SQL. This is
especially true of streaming SQL, where we are trying to support at least the common cases
(tumbling, hopping, sliding windows). In my opinion, the SQL to evaluate a hopping window
on streams in Flink, Storm or Samza, or on a historical table in Drill or Phoenix, should
be the same. Maybe you have novel operators that start off as Flink-specific extensions and
eventually migrate into core SQL.

One other thing I’d like to mention. I have been thinking for a while of introducing an
“engine” SPI into Calcite, an engine being a system that has implementations of all core
relational operators and a runtime (perhaps distributed) to execute them. Currently the only
way for a system such as Flink to use Calcite is to embed it. Flink starts up first, and invokes
Calcite for parsing and planning. But for some uses, you want to start Calcite first (perhaps
embedded) and Calcite is configured to use Flink for all operators that cannot be pushed down
to their data source. 

I think now would be a good time to introduce the engine SPI, with initial implementations
for Flink, Drill and perhaps Spark. To be clear, the main way for Drill and (I presume) Flink
to use Calcite would be by embedding, but the engine SPI would give our users more configuration
flexibility. It would also allow us to better test Calcite-Flink and Calcite-Drill integration
before we release.

Julian

[1] https://issues.apache.org/jira/browse/CALCITE-481

[2] http://www.vldb.org/pvldb/vol7/p217-karanasos.pdf


> On Dec 9, 2015, at 7:02 AM, Stephan Ewen <sewen@apache.org> wrote:
> 
> Hi Calcite Folks!
> 
> The Apache Flink community is currently looking into how to use Calcite for
> optimization of both batch and streaming programs.
> 
> We are looking to compile two different kinds of higher level APIs via
> Calcite to Flink's APIs:
>  - Table API (a LINQ-style DSL)
>  - SQL
> 
> Our current thought is to use largely the same translation paths for both,
> with different entry points into Calcite:
>  - the Table API creates a Calcite operator tree directly
>  - the SQL interface goes through the full stack, including parser, ...
> 
> 
> From what I have seen so far in Calcite, it looks pretty promising, with
> its configurable and extensible rule set, and the pluggable schema/metadata
> providers.
> 
> A few questions remain for us, to see how feasible this is:
> 
> 1) Are DAG programs supported? The table API produces operator DAGs, rather
> than pure trees. Do DAGs affect/limit the space explored by the optimizer
> engine?
> 
> 2) For streaming programs, we will probably want to add some custom syntax,
> specific to Flink's windows. Is it possible to also customize the SQL
> dialect of the parser?
> 
> These answers are quite crucial for us figure out how to best use Calcite
> in our designs. Thanks for helping us...
> 
> Greetings,
> Stephan


Mime
View raw message