drill-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Paul Rogers (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (DRILL-5546) Schema change problems caused by empty batch
Date Fri, 02 Jun 2017 15:59:04 GMT

    [ https://issues.apache.org/jira/browse/DRILL-5546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16034946#comment-16034946
] 

Paul Rogers commented on DRILL-5546:
------------------------------------

In general, I agree with the proposal. The only suggestion might be to change the emphasis.

In looking carefully at the readers, we see that an empty result set (empty batch) is a natural
outcome of reading. Some files just happen to be empty. If filters are pushed down, then some
files just happen to have no matching rows.

Readers produce two distinct kinds of empty result sets:

* *Empty result set*: The reader found no data, but was able to find a schema. (Example: Parquet
with a filter push-down or a JDBC query that returns no results.)
* *Null result set*: The reader found no data *and* no schema. (Example: empty CSV or JSON
file.)


Note that filters also can produce an empty result set (if no rows match).

The Drill iterator protocol should be able to handle both kinds. It is perhaps a bit naive
to expect that every operator has both a schema and a data set.

All operators should be able to identify, and handle, both null and empty result sets.

For the scanner, if one reader returns a null result set, just skip it and move to the next
reader until a schema is found. If no reader has a non-null result set, then that branch of
the query has no data (and no schema). That result should bubble up, with each operator handling
the case depending on semantics. For example, a filter ignores the null result set. A UNION
ALL skips that result set when assembling the result. A join handles the case depending on
the side of the join and INNER/OUTER semantics, and so on.

To support the schema "fast track", operators should return an empty batch, with just schema,
on the first call to {{next()}}. So, the scanner should return an empty batch (with schema)
if a reader produces one (that is, skip null batches, return an empty batch.)

Again, each operator should, on the first (preferably empty) batch, assemble output schema
according to the rules for that operator.

Do we have a spec and/or JIRA that describes the design behind the "fast schema" feature added
shortly after 1.0? We should consult that to ensure the empty batch handling here is consistent
with that design.

> Schema change problems caused by empty batch
> --------------------------------------------
>
>                 Key: DRILL-5546
>                 URL: https://issues.apache.org/jira/browse/DRILL-5546
>             Project: Apache Drill
>          Issue Type: Bug
>            Reporter: Jinfeng Ni
>            Assignee: Jinfeng Ni
>
> There have been a few JIRAs opened related to schema change failure caused by empty batch.
This JIRA is opened as an umbrella for all those related JIRAS ( such as DRILL-4686, DRILL-4734,
DRILL4476, DRILL-4255, etc).
>  



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Mime
View raw message