Mailing-List: contact issues-help@drill.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@drill.apache.org
Date: Fri, 2 Jun 2017 15:59:04 +0000 (UTC)
From: "Paul Rogers (JIRA)" <jira@apache.org>
To: issues@drill.apache.org
Message-ID: <JIRA.13075263.1495831286000.353144.1496419144185@Atlassian.JIRA>
In-Reply-To: <JIRA.13075263.1495831286000@Atlassian.JIRA>
References: <JIRA.13075263.1495831286000@Atlassian.JIRA> <JIRA.13075263.1495831286605@jira-lw-us.apache.org>
Subject: [jira] [Commented] (DRILL-5546) Schema change problems caused by
 empty batch
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit
archived-at: Fri, 02 Jun 2017 15:59:09 -0000


    [ https://issues.apache.org/jira/browse/DRILL-5546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16034946#comment-16034946 ] 

Paul Rogers commented on DRILL-5546:
------------------------------------

In general, I agree with the proposal. The only suggestion might be to change the emphasis.

In looking carefully at the readers, we see that an empty result set (empty batch) is a natural outcome of reading. Some files just happen to be empty. If filters are pushed down, then some files just happen to have no matching rows.

Readers produce two distinct kinds of empty result sets:

* *Empty result set*: The reader found no data, but was able to find a schema. (Example: Parquet with a filter push-down or a JDBC query that returns no results.)
* *Null result set*: The reader found no data *and* no schema. (Example: empty CSV or JSON file.)


Note that filters also can produce an empty result set (if no rows match).

The Drill iterator protocol should be able to handle both kinds. It is perhaps a bit naive to expect that every operator has both a schema and a data set.

All operators should be able to identify, and handle, both null and empty result sets.

For the scanner, if one reader returns a null result set, just skip it and move to the next reader until a schema is found. If no reader has a non-null result set, then that branch of the query has no data (and no schema). That result should bubble up, with each operator handling the case depending on semantics. For example, a filter ignores the null result set. A UNION ALL skips that result set when assembling the result. A join handles the case depending on the side of the join and INNER/OUTER semantics, and so on.

To support the schema "fast track", operators should return an empty batch, with just schema, on the first call to {{next()}}. So, the scanner should return an empty batch (with schema) if a reader produces one (that is, skip null batches, return an empty batch.)

Again, each operator should, on the first (preferably empty) batch, assemble output schema according to the rules for that operator.

Do we have a spec and/or JIRA that describes the design behind the "fast schema" feature added shortly after 1.0? We should consult that to ensure the empty batch handling here is consistent with that design.

> Schema change problems caused by empty batch
> --------------------------------------------
>
>                 Key: DRILL-5546
>                 URL: https://issues.apache.org/jira/browse/DRILL-5546
>             Project: Apache Drill
>          Issue Type: Bug
>            Reporter: Jinfeng Ni
>            Assignee: Jinfeng Ni
>
> There have been a few JIRAs opened related to schema change failure caused by empty batch. This JIRA is opened as an umbrella for all those related JIRAS ( such as DRILL-4686, DRILL-4734, DRILL4476, DRILL-4255, etc).
>  


--
This message was sent by Atlassian JIRA
(v6.3.15#6346)