drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Aman Sinha <amansi...@apache.org>
Subject Re: "Death of Schema-on-Read"
Date Sun, 08 Apr 2018 19:36:15 GMT
On Sun, Apr 8, 2018 at 10:57 AM, Ted Dunning <ted.dunning@gmail.com> wrote:

> I have been thinking about this email and I still don't understand some of
> the comments.
> On Fri, Apr 6, 2018 at 5:13 PM, Aman Sinha <amansinha@apache.org> wrote:
> > On the subject of CAST pushdown to Scans, there are potential drawbacks
> > ...
> >
> >    - In general, the planner will see a Scan-Project where the Project
> has
> >    CAST functions.  But the Project can have arbitrary expressions,  e.g
> >    CAST(a as INT) * 5  or a combination of 2 CAST functions or non-CAST
> >    functions etc.   It would be quite expensive to examine each
> expression
> >    (there could be hundreds) to determine whether it is eligible to be
> > pushed
> >    to the Scan.
> >
> How is this different than filter and project pushdown? There could be
> hundreds of those and it could be difficult for Calcite to find appropriate
> pushdowns. But I have never heard of any problem.
> - the traversal of all expressions is already required and already done in
> order to find the set of columns that are being extracted. As such, cast
> pushdown can be done in the same motions as project pushdown.

It is true that the amount of work done by the planner would be about the
same as when
determining projection pushdowns into the scan.  In my mind I was
contrasting with the
pure DDL based approach with an explicitly specified schema (such as with
a  'CREATE EXTERNAL TABLE ...' or with per query hints as Paul mentioned).
However, in the absence of those, I agree that it would be a win to do the
'simple' CAST pushdowns, keeping in mind that the same column may be
in multiple ways:  e.g   CAST(a as varchar(10)),  CAST(a as varchar(20))
in the same query/view.  In such cases, we would want to either not do the
or determine the highest common datatype and push that down.

All of this, though, does not preclude the real need for the 'source of
truth' of the schema for the cases where data has been already explored and
We do want to have a solution for that core issue.


  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message