[ https://issues.apache.org/jira/browse/DRILL5546?page=com.atlassian.jira.plugin.system.issuetabpanels:commenttabpanel&focusedCommentId=16036141#comment16036141
]
Paul Rogers edited comment on DRILL5546 at 6/5/17 7:31 AM:

I believe we are saying basically the same thing. To be certain, [watch out, we're gonna try
some theoryhttps://store.xkcd.com/products/tryscience].
h3. Basics
We'll need some terminology, defined in the usual way:
* (a, b, c) is a set that contains a, b, c
* \{a:b} is a map from a to b
* \[a, b, c] is an ordered list of a, b, c, where each element has an index i = 0, 1, 2...
* empty = () or {} or \[] is the empty collection (of the proper type)
* null = SQL null: we don't know what the value is
* ∧ = Java, C, etc. null: we know the value, and it is nothing
Drill is, at its core, relational. A relation R can be defined as:
{{R = (S, T)}}
Where:
* S is the schema
* T is a set of tuples (t ~1~, t ~2~, t ~3~) (AKA a table)
{{S = (C, N)}}
Where:
* C is the list of column schemas:
{{C = \[ c ~0~, c ~1~, c ~2~, ...]}}
{{c = (name, type)}}
And:
* N is a map from name to column index:
{{N = \{name : i\} }} where i the index of column c ∈ C
Drill defines the idea of _compatible_ schemas. Two schemas are compatible if we redefine
the schema as a set of columns:
{{S’ = (C ~0~, C ~1~, …)}}
Two schemas are compatible iff the column sets are identical (same name and type for each
column). This is a bit more forgiving than the traditional relational model which requires
that the ordered list of column schemas be identical. Let's assume this rule as we discuss
schema below.
We'll also need the idea of _cardinality_:
{{\S = n}}
Says that the cardinality (number of items) in S is _n_. Later, well just use _n_ to mean
a schema (or relation or whatever) that has n items.
h3. Relations and MultiRelations
A relation can be:
{{R : ∧}} (programming null)  the relation simply does not exist.
{{  (0,0)}}  the trivial relation of no columns and (by definition),
no rows.
{{  (s,0)}}  a relation with some schema s, s ≠ 0, and no rows
{{  (s,n)}}  a "normal" relation with schema and n tuples (rows) of
data, \s ≠ 0, n ≠ 0
It is helpful to remember some basic relational algebra:
{{R(s,0) ⋃ R(s,n) = R(s,n)}}
{{R(s,n) ⋃ R(s,m) = R(s,n+m)}}
Drill works not just with relations R, but also "multirelations" M. That is, a multirelation
(the entire result set from a single query) can defined (using semiBNF) as:
{{M : ∧}}  undefined result set
{{  R(0,0)}}  trivial result set
{{  R(s,0)}}  empty result set, \s ≠ 0
{{  R(s,n)}}  normal, single result set, n ≠ 0
{{  R ~1~(s ~1~,n), R ~1~(s ~2~,m), ...}}  multirelation if s ~i~ ≠
s ~j~
Normally when we say multirelation, we mean the last case: two or more relations with distinct
schemas. The condition above says that to adjacent relations R ~i~ and R ~j~ must have distinct
schema (or by the rules above, two relations with the same schema just collapse into a single
relation with that schema. A schema can repeat, but a different schema must occur between
repetitions.)
h3. MultiRelations in Drill
Now let's get to Drill. Drill uses the term "schema change" to mean the transition from s
~i~ to s ~j~, s ~i~ ≠ s ~j~.
The essence of the proposal here, as I understand it, is to update the implementation to fully
support the first three definitions of a multirelation (undefined, trivial and empty), assuming
that Drill already supports the other two definitions (single relation and multirelation.)
In Drill, relations are broken into batches, which are, themselves, just relations. Thus a
batch, B, can be:
{{B : ∧}}  undefined result set)
{{  R(0,0)}}  trivial batch
{{  R(s,0)}}  empty batch set, s ≠ 0
{{  R(s,n)}}  normal batch, s ≠ 0, n ≠ 0
And the whole result set D (for Drill) is an ordered list of batches:
{{D = \[B ~1~, B ~2~, ..., B ~n~]}}
Where
{{B ~i~ = (s ~i~, t ~i~)}}
As noted above, if adjacent batches have the same schema, then they are just subrelations
within a single larger (logical) relation. But, if the schemas differ, then the adjacent batches
are the last and first of two distinct relations within a larger (logical) multirelation.
Said another way:
{{D = R ~1~, R ~2~}}
{{R ~i~ = B ~i,1~, B ~i,2~, ...}}
To clarify, let’s visualize the schema changes as ∆~i~:
{{D = \[B ~0~(s ~0~,t), B ~1~(s ~0~,t), … ∆~1~, B ~i~(s ~1~,t), B ~i+1~(s ~1~,t), …]}}
The above sequence can describe any series of batches. As I understand the proposal, we want
to put some constraints on the sequence.
h3. Results of a Drill Query
Let’s start by defining how should present the multirelation to the client. The client
also receives batches, but, following the rules in the proposal, we want to constrain the
output:
{{D ~e~ : B(0,0)}}  Null result
{{  B(s,0)}}  Empty result
{{  B(s,n)\+}}  “Classic” O/JDBC result
{{  (B(s ~i~,n)\+)\+}}  Multirelation result
(Here, the parenthesis take on their regex/BNF meaning of grouping.)
That is the output of a Drill query is:
* A single trivial batch (zero columns, zero rows)
* A single empty batch (schema, but no data)
* One or more nonempty batches
For the nonempty case, we end up with two cases:
* A single relation (O/JDBC compatible)
* A multirelation
h3. Drill Internals
Internally, we might want to allow any pattern of batches (include empty or null); Drill operators
do the necessary conversion:
{{D ~i~ = \[B\+, (∆, B\+)\+]}}
We want to remove empty batches where possible to convert the internal multirelation to the
external one. First, let’s define an empty batch:
{{emptybatch : ∧}}  undefined result set
{{  R(0,0)}}  trivial result set
{{  R(s,0)}}  empty result set, s ≠ 0
We can define some harmless extensions to the normal rules for unions of relations:
{{∧ ⋃ R = R}}, for any relation R
{{R(0,0) ⋃ R = R}}, for {{R ≠ ∧}}
{{R(c,0) ⋃ R = R}}, for {{R ≠ ∧, R(0,0)}}
These rules define a “hierarchy of emptiness” to deal with adjacent empty batches. The
result of the above rules are that we can remove empty batches from a Drill result set as
long as there are adjacent nonempty batches. But, suppose the (internal to Drill) result
set has only one empty batch, B ~1~, then the external (visible to client) result set , D
~e~, is:
{{D ~e~ = R(0,0)}} if {{B ~1~ = ∧}} or {{B ~1~ = R(0,0)}}
{{D ~e~ = R(c,0)}} if {{B ~1~ = R(c,0)}}
Otherwise, according to what we said earlier, since the internal result set has at least one
nonempty batch, then the external result set will be nonempty. This is true because we can
(indeed, should) remove empt batches internal to Drill using the rules noted above and explained
in this proposal.
h3. The Drill Iterator Protocol
Finally, we can put all this together. The proposal focuses on Drill’s Volcanobased iterator
protocol. The iterator protocol labels batches with one of three iterator outcomes, as defined
by an enum E:
{{e(s,t) ∈ E}}
{{E : NONE(∧)}}
{{  OK_NEW_SCHEMA(s,t)}}
{{  OK(s,t)}}
That is, each iterator outcome labels a batch and so carries a schema and data set. NONE is
a special case as it carries an undefined batch. This definition lets us map the iterator
protocol directly to a slightly modified version of the multirelation sequence shown earlier:
{{D = \[∆ ~0~, B ~0~(s ~0~,t), B ~1~(s ~0~,t), … ∆ ~1~, B ~i~(s ~1~,t), B ~i+1~(s ~1~,t),
… ∧]}}
{{OK_NEW_SCHEMA = \[∆, B]}}
{{OK = \[B]}}
{{NONE = ∧}}
Now we can define the iterator protocol:
{{protocol : (OK_NEW_SCHEMA OK\*)\* NONE}}
The “fast NONE” path is simply an empty sequence of {{(OK_NEW_SCHEMA OK\*)\*}} terms.
h3. Conclusion
The proposal for this ticket says that operators should handle the empty batches. The above
suggests *why* the operators should handle them. The best solution for *how* to handle them
is so modify the "inner next" code that calls a child operator; have that code "compress out"
empty batches if followed by a nonempty batch.
This note suggests that we look at empty batches (in all three forms) as a natural part of
Drill's multirelation model. Rather than treating them as special cases, empty batches should
be defined as a normal part of Drill's internal protocol.
A separate proposal (part of DRILL5211) will suggest that the scanner use the above rules
to compress out empty batches that emerge from a variety of special cases with readers.
was (Author: paulrogers):
I believe we are saying basically the same thing. To be certain, [watch out, we're gonna try
some theoryhttps://store.xkcd.com/products/tryscience].
h3. Basics
We'll need some terminology, defined in the usual way:
* (a, b, c) is a set that contains a, b, c
* \{a:b} is a map from a to b
* \[a, b, c] is an ordered list of a, b, c, where each element has an index i = 0, 1, 2...
* empty = () or {} or \[] is the empty collection (of the proper type)
* null = SQL null: we don't know what the value is
* ∧ = Java, C, etc. null: we know the value, and it is nothing
Drill is, at its core, relational. A relation R can be defined as:
{{R = (S, T)}}
Where:
* S is the schema
* T is a set of tuples (t ~1~, t ~2~, t ~3~) (AKA a table)
{{S = (C, N)}}
Where:
* C is the list of column schemas:
{{C = \[ c ~0~, c ~1~, c ~2~, ...]}}
{{c = (name, type)}}
And:
* N is a map from name to column index:
{{N = \{name : i\} }} where i the index of column c ∈ C
Drill defines the idea of _compatible_ schemas. Two schemas are compatible if we redefine
the schema as a set of columns:
{{S’ = (C ~0~, C ~1~, …)}}
Two schemas are compatible iff the column sets are identical (same name and type for each
column). This is a bit more forgiving than the traditional relational model which requires
that the ordered list of column schemas be identical. Let's assume this rule as we discuss
schema below.
We'll also need the idea of _cardinality_:
{{\S = n}}
Says that the cardinality (number of items) in S is _n_. Later, well just use _n_ to mean
a schema (or relation or whatever) that has n items.
h3. Relations and MultiRelations
A relation can be:
{{R : ∧}} (programming null)  the relation simply does not exist.
{{  (0,0)}}  the trivial relation of no columns and (by definition),
no rows.
{{  (s,0)}}  a relation with some schema s, s ≠ 0, and no rows
{{  (s,n)}}  a "normal" relation with schema and n tuples (rows) of
data, \s ≠ 0, n ≠ 0
It is helpful to remember some basic relational algebra:
{{R(s,0) ⋃ R(s,n) = R(s,n)}}
{{R(s,n) ⋃ R(s,m) = R(s,n+m)}}
Drill works not just with relations R, but also "multirelations" M. That is, a multirelation
(the entire result set from a single query) can defined (using semiBNF) as:
{{M : ∧}}  undefined result set
{{  R(0,0)}}  trivial result set
{{  R(s,0)}}  empty result set, \s ≠ 0
{{  R(s,n)}}  n ≠ 0, normal, single result set
{{  R ~1~(s ~1~,n), R ~1~(s ~2~,m), ...}} multirelation if s ~i~ ≠
s ~j~
Normally when we say multirelation, we mean the last case: two or more relations with distinct
schemas. The condition above says that to adjacent relations R ~i~ and R ~j~ must have distinct
schema (or by the rules above, two relations with the same schema just collapse into a single
relation with that schema. A schema can repeat, but a different schema must occur between
repetitions.)
h3. MultiRelations in Drill
Now let's get to Drill. Drill uses the term "schema change" to mean the transition from s
~i~ to s ~j~, s ~i~ ≠ s ~j~.
The essence of the proposal here, as I understand it, is to update the implementation to fully
support the first three definitions of a multirelation (undefined, trivial and empty), assuming
that Drill already supports the other two definitions (single relation and multirelation.)
In Drill, relations are broken into batches, which are, themselves, just relations. Thus a
batch, B, can be:
{{B : ∧}}  undefined result set)
{{  R(0,0)}}  trivial batch
{{  R(s,0)}}  empty batch set, s ≠ 0
{{  R(s,n)}}  normal batch, s ≠ 0, n ≠ 0
And the whole result set D (for Drill) is an ordered list of batches:
{{D = \[B ~1~, B ~2~, ..., B ~n~]}}
Where
{{B ~i~ = (s ~i~, t ~i~)}}
As noted above, if adjacent batches have the same schema, then they are just subrelations
within a single larger (logical) relation. But, if the schemas differ, then the adjacent batches
are the last and first of two distinct relations within a larger (logical) multirelation.
Said another way:
{{D = R ~1~, R ~2~}}
{{R ~i~ = B ~i,1~, B ~i,2~, ...}}
To clarify, let’s visualize the schema changes as ∆~i~:
{{D = \[B ~0~(s ~0~,t), B ~1~(s ~0~,t), … ∆~1~, B ~i~(s ~1~,t), B ~i+1~(s ~1~,t), …]}}
The above sequence can describe any series of batches. As I understand the proposal, we want
to put some constraints on the sequence.
h3. Results of a Drill Query
Let’s start by defining how should present the multirelation to the client. The client
also receives batches, but, following the rules in the proposal, we want to constrain the
output:
{{D ~e~ : B(0,0)}}  Null result
{{  B(s,0)}}  Empty result
{{  B(s,n)\+}}  “Classic” O/JDBC result
{{  (B(s ~i~,n)\+)\+}}  Multirelation result
(Here, the parenthesis take on their regex/BNF meaning of grouping.)
That is the output of a Drill query is:
* A single trivial batch (zero columns, zero rows)
* A single empty batch (schema, but no data)
* One or more nonempty batches
For the nonempty case, we end up with two cases:
* A single relation (O/JDBC compatible)
* A multirelation
h3. Drill Internals
Internally, we might want to allow any pattern of batches (include empty or null); Drill operators
do the necessary conversion:
{{D ~i~ = \[B\+, (∆, B\+)\+]}}
We want to remove empty batches where possible to convert the internal multirelation to the
external one. First, let’s define an empty batch:
{{emptybatch : ∧}}  undefined result set
{{  R(0,0)}}  trivial result set
{{  R(s,0)}}  empty result set, s ≠ 0
We can define some harmless extensions to the normal rules for unions of relations:
{{∧ ⋃ R = R}}, for any relation R
{{R(0,0) ⋃ R = R}}, for {{R ≠ ∧}}
{{R(c,0) ⋃ R = R}}, for {{R ≠ ∧, R(0,0)}}
These rules define a “hierarchy of emptiness” to deal with adjacent empty batches. The
result of the above rules are that we can remove empty batches from a Drill result set as
long as there are adjacent nonempty batches. But, suppose the (internal to Drill) result
set has only one empty batch, B ~1~, then the external (visible to client) result set , D
~e~, is:
{{D ~e~ = R(0,0)}} if {{B ~1~ = ∧}} or {{B ~1~ = R(0,0)}}
{{D ~e~ = R(c,0)}} if {{B ~1~ = R(c,0)}}
Otherwise, according to what we said earlier, since the internal result set has at least one
nonempty batch, then the external result set will be nonempty. This is true because we can
(indeed, should) remove empt batches internal to Drill using the rules noted above and explained
in this proposal.
h3. The Drill Iterator Protocol
Finally, we can put all this together. The proposal focuses on Drill’s Volcanobased iterator
protocol. The iterator protocol labels batches with one of three iterator outcomes, as defined
by an enum E:
{{e(s,t) ∈ E}}
{{E : NONE(∧)}}
{{  OK_NEW_SCHEMA(s,t)}}
{{  OK(s,t)}}
That is, each iterator outcome labels a batch and so carries a schema and data set. NONE is
a special case as it carries an undefined batch. This definition lets us map the iterator
protocol directly to a slightly modified version of the multirelation sequence shown earlier:
{{D = \[∆ ~0~, B ~0~(s ~0~,t), B ~1~(s ~0~,t), … ∆ ~1~, B ~i~(s ~1~,t), B ~i+1~(s ~1~,t),
… ∧]}}
{{OK_NEW_SCHEMA = \[∆, B]}}
{{OK = \[B]}}
{{NONE = ∧}}
Now we can define the iterator protocol:
{{protocol : (OK_NEW_SCHEMA OK\*)\* NONE}}
The “fast NONE” path is simply an empty sequence of {{(OK_NEW_SCHEMA OK\*)\*}} terms.
h3. Conclusion
The proposal for this ticket says that operators should handle the empty batches. The above
suggests *why* the operators should handle them. The best solution for *how* to handle them
is so modify the "inner next" code that calls a child operator; have that code "compress out"
empty batches if followed by a nonempty batch.
This note suggests that we look at empty batches (in all three forms) as a natural part of
Drill's multirelation model. Rather than treating them as special cases, empty batches should
be defined as a normal part of Drill's internal protocol.
A separate proposal (part of DRILL5211) will suggest that the scanner use the above rules
to compress out empty batches that emerge from a variety of special cases with readers.
> Schema change problems caused by empty batch
> 
>
> Key: DRILL5546
> URL: https://issues.apache.org/jira/browse/DRILL5546
> Project: Apache Drill
> Issue Type: Bug
> Reporter: Jinfeng Ni
> Assignee: Jinfeng Ni
>
> There have been a few JIRAs opened related to schema change failure caused by empty batch.
This JIRA is opened as an umbrella for all those related JIRAS ( such as DRILL4686, DRILL4734,
DRILL4476, DRILL4255, etc).
>

This message was sent by Atlassian JIRA
(v6.3.15#6346)
