drill-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Paul Rogers (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (DRILL-5546) Schema change problems caused by empty batch
Date Sun, 04 Jun 2017 01:33:04 GMT

    [ https://issues.apache.org/jira/browse/DRILL-5546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16036141#comment-16036141
] 

Paul Rogers commented on DRILL-5546:
------------------------------------

I believe we are saying basically the same thing. To be certain, [watch out, we're gonna do
some theory|https://store.xkcd.com/products/try-science].

h3. Basics

We'll need some terminology, defined in the usual way:

* (a, b, c) is a set that contains a, b, c
* \{a:b} is a map from a to b
* \[a, b, c] is an ordered list of a, b, c, where each element has an index i = 0, 1, 2...
* empty = () or {} or \[] is the empty collection (of the proper type)
* null = SQL null: we don't know what the value is
* ∧ = Java, C, etc. null: we know the value, and it is nothing

Drill is, at its core, relational. A relation R can be defined as:

{{R = (S, T)}}

Where:

* S is the schema
* T is a set of tuples (t ~1~, t ~2~, t ~3~) (AKA a table)

{{S = (C, N)}}

Where:

* C is the list of column schemas:

{{C = \[ c ~0~, c ~1~, c ~2~, ...]}}
{{c = (name, type)}}

And:

* N is a map from name to column index:

{{N = \{name : i\}&nbsp;}} where i the index of column c ∈ C

Drill defines the idea of _compatible_ schemas. Two schemas are compatible if we redefine
the schema as a set of columns:

{{S’ = (C ~0~, C ~1~, …)}}

Two schemas are compatible iff the column sets are identical (same name and type for each
column). This is a bit more forgiving than the traditional relational model which requires
that the ordered list of column schemas be identical. Let's assume this rule as we discuss
schema below.

We'll also need the idea of _cardinality_:

{{\|S| = n}}

Says that the cardinality (number of items) in S is _n_. Later, well just use _n_ to mean
a schema (or relation or whatever) that has n items.

h3. Relations and Multi-Relations

A relation can be:

{{R : ∧}} (programming null) -- the relation simply does not exist.
{{&nbsp;&nbsp;| (0,0)}} -- the trivial relation of no columns and (by definition),
no rows.
{{&nbsp;&nbsp;| (s,0)}} -- a relation with some schema s, |s| ≠ 0, and no rows
{{&nbsp;&nbsp;| (s,n)}} -- a "normal" relation with schema and n tuples (rows) of
data, \|s| ≠ 0, n ≠ 0

It is helpful to remember some basic relational algebra:

{{R(s,0) ⋃ R(s,n) = R(s,n)}}
{{R(s,n) ⋃ R(s,m) = R(s,n+m)}}

Drill works not just with relations R, but also "multi-relations" M. That is, a multi-relation
(the entire result set from a single query) can defined (using semi-BNF) as:

{{M : ∧}} -- undefined result set)
{{&nbsp;&nbsp;| R(0,0)}} -- trivial result set
{{&nbsp;&nbsp;| R(c,0)}}  - empty result set, \|c| ≠ 0
{{&nbsp;&nbsp;| R(c,n)}} -- n  ≠ 0, normal, single result set
{{&nbsp;&nbsp;| R ~1~(s ~1~,n), R ~1~(s ~2~,m), ...}} multi-relation if s ~i~ ≠
s ~j~

Normally when we say multi-relation, we mean the last case: two or more relations with distinct
schemas. The condition above says that to adjacent relations R ~i~ and R ~j~ must have distinct
schema (or by the rules above, two relations with the same schema just collapse into a single
relation with that schema. A schema can repeat, but a different schema must occur between
repetitions.)

h3. Multi-Relations in Drill

Now let's get to Drill. Drill uses the term "schema change" to mean the transition from s
~i~ to s ~j~, s ~i~ ≠ s ~j~.

The essence of the proposal here, as I understand it, is to update the implementation to fully
support the first three definitions of a multi-relation (undefined, trivial and empty), assuming
that Drill already supports the other two definitions (single relation and multi-relation.)

In Drill, relations are broken into batches, which are, themselves, just relations. Thus a
batch, B, can be:

{{B : ∧}} -- undefined result set)
{{&nbsp;&nbsp;| R(0,0)}} -- trivial batch
{{&nbsp;&nbsp;| R(c,0)}} -- empty batch set, |c| ≠ 0
{{&nbsp;&nbsp;| R}} -- normal batch

And the whole result set D (for Drill) is an ordered list of batches:

{{D = \[B ~1~, B ~2~, ..., B ~n~]}}

Where

{{B ~i~ = (s ~i~, t ~i~)}}

As noted above, if adjacent batches have the same schema, then they are just sub-relations
within a single larger (logical) relation. But, if the schemas differ, then the adjacent batches
are the last and first of two distinct relations within a larger (logical) multi-relation.
Said another way:

{{D = R ~1~, R ~2~}}
{{R ~i~ = B ~i,1~, B ~i,2~, ...}}

To clarify, let’s visualize the schema changes as ∆~i~:

{{D = \[B ~0~(s ~0~,t), B ~1~(s ~0~,t), … ∆~1~, B ~i~(s ~1~,t), B ~i+1~(s ~1~,t), …]}}

The above sequence can describe any series of batches. As I understand the proposal, we want
to put some constraints on the sequence.

h3. Results of a Drill Query

Let’s start by defining how should present the multi-relation to the client. The client
also receives batches, but, following the rules in the proposal, we want to constrain the
output:

{{D ~e~ : B(0,0)}} -- Null result
{{&nbsp;&nbsp;| B(s,0)}} -- Empty result
{{&nbsp;&nbsp;| B(s,n)\+}} -- “Classic” O/JDBC result
{{&nbsp;&nbsp;| (B(s,n)\+)\+}} -- Multi-relation result

(Here, the parenthesis take on their regex/BNF meaning of grouping.)

That is the output of a Drill query is:

* A single trivial batch (zero columns, zero rows)
* A single empty batch (schema, but no data)
* One or more non-empty batches

For the non-empty case, we end up with two cases:

* A single relation (O/JDBC compatible)
* A multi-relation

h3. Drill Internals

Internally, we might want to allow any pattern of batches (include empty or null); Drill operators
do the necessary conversion:

{{D ~i~ = \[B\+,  (∆, B\+)\+]}}

We want to remove empty batches where possible to convert the internal multi-relation to the
external one. First, let’s define an empty batch:

{{empty-batch : ∧}} -- undefined result set
{{&nbsp;&nbsp;| R(0,0)}} -- trivial result set
{{&nbsp;&nbsp;| R(c,0)}} -- empty result set, |c| ≠ 0

We can define some harmless extensions to the normal rules for unions of relations:

{{∧ ⋃ R = R}}, for any relation R
{{R(0,0) ⋃ R = R}}, for {{R ≠ ∧}}
{{R(c,0) ⋃ R = R}}, for {{R  ≠ ∧, R(0,0)}}

These rules define a “hierarchy of emptiness” to deal with adjacent empty batches. The
result of the above rules are that we can remove empty batches from a Drill result set as
long as there are adjacent non-empty batches. But, suppose the (internal to Drill) result
set has only one empty batch, B ~1~, then the external (visible to client) result set , D
~e~, is:

{{D ~e~ = R(0,0)}} if {{B ~1~ = ∧}} or {{B ~1~ = R(0,0)}}
{{D ~e~ = R(c,0)}} if {{B ~1~ = R(c,0)}}

Otherwise, according to what we said earlier, since the internal result set has at least one
non-empty batch, then the external result set will be non-empty. This is true because we can
(indeed, should) remove empt batches internal to Drill using the rules noted above and explained
in this proposal.

h3. The Drill Iterator Protocol

Finally, we can put all this together. The proposal focuses on Drill’s Volcano-based iterator
protocol. The iterator protocol labels batches with one of three iterator outcomes, as defined
by an enum E:

{{e(s,t) ∈ E}}
{{E : NONE(∧)}}
{{&nbsp;&nbsp;| OK_NEW_SCHEMA(s,t)}}
{{&nbsp;&nbsp;| OK(s,t)}}

That is, each iterator outcome labels a batch and so carries a schema and data set. NONE is
a special case as it carries an undefined batch. This definition lets us map the iterator
protocol directly to a slightly modified version of the multi-relation sequence shown earlier:

{{D = \[∆ ~0~, B ~0~(s ~0~,t), B ~1~(s ~0~,t), … ∆ ~1~, B ~i~(s ~1~,t), B ~i+1~(s ~1~,t),
… ∧]}}
{{OK_NEW_SCHEMA = \[∆, B]}}
{{OK = \[B]}}
{{NONE = ∧}}

Now we can define the iterator protocol:

{{protocol : (OK_NEW_SCEMA OK\*)\* NONE}}

The “fast NONE” path is simply an empty sequence of {{(OK_NEW_SCHEMA OK\*)\*}} terms.

h3. Conclusion

The proposal for this ticket says that operators should handle the empty batches. The above
suggests *how* the operators should handle them.

This note suggests that we look at empty batches (in all three forms) as a natural part of
Drill's multi-relation model. Rather than treating them as special cases, empty batches should
be defined as a normal part of Drill's internal protocol.

A separate proposal (part of DRILL-5211) will suggest that the scanner use the above rules
to compress out empty batches that emerge from a variety of special cases with readers.

> Schema change problems caused by empty batch
> --------------------------------------------
>
>                 Key: DRILL-5546
>                 URL: https://issues.apache.org/jira/browse/DRILL-5546
>             Project: Apache Drill
>          Issue Type: Bug
>            Reporter: Jinfeng Ni
>            Assignee: Jinfeng Ni
>
> There have been a few JIRAs opened related to schema change failure caused by empty batch.
This JIRA is opened as an umbrella for all those related JIRAS ( such as DRILL-4686, DRILL-4734,
DRILL4476, DRILL-4255, etc).
>  



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Mime
View raw message