drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Zelaine Fong <zf...@mapr.com>
Subject Re: Is it possible to delegate data joins and filtering to the datasource ?
Date Thu, 23 Mar 2017 21:14:54 GMT
The JDBC storage plugin does attempt to do pushdowns of joins.  However, the Drill optimizer
will evaluate different query plans.  In doing so, it may choose an alternative plan that
does not do a full pushdown if it believes that’s a less costly plan than a full pushdown.
 There are a number of open bugs with the JDBC storage plugin, including DRILL-4696.  For
that particular issue, I believe that when it was investigated, it was determined that the
costing model for the JDBC storage plugin needed more work.  Hence Drill wasn’t picking
the more optimal full pushdown plan.

-- Zelaine

On 3/23/17, 1:53 PM, "Paul Rogers" <progers@mapr.com> wrote:

    Hi Muhammad,
    It seems that the goal for filters should be possible; I’m not familiar enough with
the code to know if joins are currently supported, or if this is where you’d have to make
some contributions to Drill.
    The storage plugin is called at various places in the planning process, and can insert
planning rules. We have plugins that push down filters, so this seems possible. For example,
check Parquet and JDBC for hints. See my answer to a previous question for hints on how to
get started with storage plugins.
    Joins may be a bit more complex. You’d have to insert planner rules; such code *may*
be available, or may require extensions to Drill. Drill should certainly do this, so if the
code is not there, we’d welcome your contribution.
    You’d have to create an rule that creates a new scan operator that includes the information
you wish to push down. For example, if you push a filter, the scan definition (AKA group scan
and scan entry) would need to hold the information needed to implement the push-down. Again,
you can probably find examples of filters, you’d have to be creative to push joins.
    Assembling the pieces: your plugin would add planner rules that determine when joins can
be pushed. Those rules would case your plugin to create a semantic node (group scan) that
holds the required information. The planner then converts group scan nodes to specific plans
passed to the execution engine. On the execution side, your plugin provides a “Record Reader”
for your format, and that reader does the actual work to push the filter or join down to your
data source.
    Your best bet is to mine existing plugins for ideas, and then experiment. Start simply
and gradually add functionality. And, ask questions back on this list.
    - Paul
    > On Mar 22, 2017, at 8:20 AM, Muhammad Gelbana <m.gelbana@gmail.com> wrote:
    > I'm trying to use Drill with a proprietary datasource that is very fast in
    > applying data joins (i.e. SQL joins) and query filters (i.e. SQL where
    > conditions).
    > To connect to that datasource, I first have to write a storage plugin, but
    > I'm not sure if my main goal is applicable.
    > May main goal is to configure Drill to let the datasource perform JOINS and
    > filters and only return the data. Then drill can perform further processing
    > based on the original SQL query sent to Drill.
    > Is this possible by developing a storage plugin ? Where exactly should I be
    > looking ?
    > I've been going through this wiki
    > <https://github.com/paul-rogers/drill/wiki> and I don't think I understood
    > every concept. So if there is another source of information about storage
    > plugins development, please point it out.
    > *---------------------*
    > *Muhammad Gelbana*
    > http://www.linkedin.com/in/mgelbana

View raw message