drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jinfeng Ni <jinfengn...@gmail.com>
Subject Re: [DISCUSS] Improving Fast Schema
Date Thu, 05 Nov 2015 22:10:20 GMT
DRILL-3623 is originally to get schema in planning time for Hive
table. Once parquet becomes schema-ed, it could be applied to parquet
table.

However, there are issues in terms of type resolution. The following
is the comments I put in the PR for DRILL-3623.

"

The original approach (skipping the execution phase for limit 0
completely), actually could potentially have issues in some cases, due
to the difference in Calcite rule and Drill execution rule, in terms
of how type is determined.

For example, sum(int) in calcite is resolved to be int, while in Drill
execution, we changed to bigint. Another case is implicit cast.
Currently, there are some small differences between Calcite and Drill
execution. That means, if we skip the execution for limit 0, then
types which are resolved in Calcite could be different from the type
if the query goes through Drill execution. For BI tool like Tableau,
that means the type returned from "limit 0" query and type from a
second query w/o "limit 0" could be different.

If we want to avoid the above issues, we have to detect all those
cases, which are painful. That's why Sudheesh and I are now more
inclined to this new approach.

"

Before we could resolve the difference of type resolution between
planning / execution, we could not directly return schema in planning
time. One good news is that Calcite recently has fixes to allow
specification how aggregation type is returned, which should fix the
first issue.


On Thu, Nov 5, 2015 at 1:54 PM, Zelaine Fong <zfong@maprtech.com> wrote:
> I agree.  That makes total sense from a conceptual standpoint.  What's
> needed to do this?  Is the framework in place for Drill to do this?
>
> -- Zelaine
>
> On Thu, Nov 5, 2015 at 1:51 PM, Parth Chandra <parthc@apache.org> wrote:
>
>> I like the idea of making Parquet/Hive schema'd and returning the schema at
>> planning time. Front end tools assume that the backend can do a Prepare and
>> then Execute and this fits that model much better.
>>
>>
>>
>> On Thu, Nov 5, 2015 at 1:16 PM, Jacques Nadeau <jacques@dremio.com> wrote:
>>
>> > The only way we get to a few milliseconds is by doing this stuff at
>> > planning. Let's start by making Parquet schema'd and fixing our implicit
>> > cast rules. Once completed, we can return schema just through planning
>> and
>> > completely skip over execution code (as in every other database).
>> >
>> > I'd guess that the top issue is for Parquet and Hive. If that is the
>> case,
>> > let's just start treating them as schema'd all the way through. If people
>> > are begging for fast schema on JSON, let's take the stuff for Parquet and
>> > Hive and leverage via direct sampling at planning type for the
>> non-schema'd
>> > formats.
>> >
>> > --
>> > Jacques Nadeau
>> > CTO and Co-Founder, Dremio
>> >
>> > On Thu, Nov 5, 2015 at 11:32 AM, Steven Phillips <steven@dremio.com>
>> > wrote:
>> >
>> > > For (3), are you referring to the operators with extend
>> > > AbstractSingleRecordBatch? We basically only call the buildSchema()
>> > method
>> > > on blocking operators. If the operators are not blocking, we simply
>> > process
>> > > the first batch, the idea being that it should be fast enough. Are
>> there
>> > > situation where this is not true? If we are skipping empty batches,
>> that
>> > > could cause a delay in the schema propagation, but we can handle that
>> > case
>> > > by having special handling for the first batch.
>> > >
>> > > As for (4), its really historical. We originally didn't have fast
>> schema,
>> > > and when it was added, only the minimal code changes necessary to make
>> it
>> > > work were done. At the time the fast schema feature was implemented,
>> > there
>> > > was just the "setup" method of the operators, which handled both
>> > > materializing the output batch as well as generating the code. It would
>> > > require additional work as well as potentially adding code complexity
>> to
>> > > further separate the parts of setup that are needed for fast schema
>> from
>> > > those which are not. And I'm not sure how much benefit we would get
>> from
>> > > it.
>> > >
>> > > What is the motivation behind this? In other words, what sort of delays
>> > are
>> > > you currently seeing? And have you done an analysis of what is causing
>> > the
>> > > delay? I would think that code generation would cause only a minimal
>> > delay,
>> > > unless we are concerned about cutting the time for "limit 0" queries
>> down
>> > > to just a few milliseconds.
>> > >
>> > > On Thu, Nov 5, 2015 at 9:53 AM, Sudheesh Katkam <skatkam@maprtech.com>
>> > > wrote:
>> > >
>> > > > Hey y’all,
>> > > >
>> > > > @Jacques and @Steven,
>> > > >
>> > > > I am looking at improving the fast schema path (for LIMIT 0 queries).
>> > It
>> > > > seems to me that on the first call to next (the buildSchema call),
in
>> > any
>> > > > operator, only two tasks need to be done:
>> > > > 1) call next exactly once on each of the incoming batches, and
>> > > > 2) setup the output container based on those incoming batches
>> > > >
>> > > > However, looking at the implementation, some record batches:
>> > > > 3) make multiple calls to incoming batches (with a comment “skip
>> first
>> > > > batch if count is zero, as it may be an empty schema batch”),
>> > > > 4) generate code, etc.
>> > > >
>> > > > Any reason why (1) and (2) aren’t sufficient? Any optimizations
that
>> > were
>> > > > considered, but not implemented?
>> > > >
>> > > > Thank you,
>> > > > Sudheesh
>> > >
>> >
>>

Mime
View raw message