Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 3B445200C88 for ; Fri, 2 Jun 2017 17:59:09 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 396AD160BDD; Fri, 2 Jun 2017 15:59:09 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 8A3AE160BD2 for ; Fri, 2 Jun 2017 17:59:08 +0200 (CEST) Received: (qmail 85317 invoked by uid 500); 2 Jun 2017 15:59:07 -0000 Mailing-List: contact issues-help@drill.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@drill.apache.org Delivered-To: mailing list issues@drill.apache.org Received: (qmail 85308 invoked by uid 99); 2 Jun 2017 15:59:07 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 02 Jun 2017 15:59:07 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id 39A3AC028B for ; Fri, 2 Jun 2017 15:59:07 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -100.002 X-Spam-Level: X-Spam-Status: No, score=-100.002 tagged_above=-999 required=6.31 tests=[RP_MATCHES_RCVD=-0.001, SPF_PASS=-0.001, USER_IN_WHITELIST=-100] autolearn=disabled Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id CxDViTKzLHvi for ; Fri, 2 Jun 2017 15:59:06 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTP id 36B745FB43 for ; Fri, 2 Jun 2017 15:59:05 +0000 (UTC) Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id 85FB5E0D27 for ; Fri, 2 Jun 2017 15:59:04 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id 2DDDD21B59 for ; Fri, 2 Jun 2017 15:59:04 +0000 (UTC) Date: Fri, 2 Jun 2017 15:59:04 +0000 (UTC) From: "Paul Rogers (JIRA)" To: issues@drill.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (DRILL-5546) Schema change problems caused by empty batch MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Fri, 02 Jun 2017 15:59:09 -0000 [ https://issues.apache.org/jira/browse/DRILL-5546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16034946#comment-16034946 ] Paul Rogers commented on DRILL-5546: ------------------------------------ In general, I agree with the proposal. The only suggestion might be to change the emphasis. In looking carefully at the readers, we see that an empty result set (empty batch) is a natural outcome of reading. Some files just happen to be empty. If filters are pushed down, then some files just happen to have no matching rows. Readers produce two distinct kinds of empty result sets: * *Empty result set*: The reader found no data, but was able to find a schema. (Example: Parquet with a filter push-down or a JDBC query that returns no results.) * *Null result set*: The reader found no data *and* no schema. (Example: empty CSV or JSON file.) Note that filters also can produce an empty result set (if no rows match). The Drill iterator protocol should be able to handle both kinds. It is perhaps a bit naive to expect that every operator has both a schema and a data set. All operators should be able to identify, and handle, both null and empty result sets. For the scanner, if one reader returns a null result set, just skip it and move to the next reader until a schema is found. If no reader has a non-null result set, then that branch of the query has no data (and no schema). That result should bubble up, with each operator handling the case depending on semantics. For example, a filter ignores the null result set. A UNION ALL skips that result set when assembling the result. A join handles the case depending on the side of the join and INNER/OUTER semantics, and so on. To support the schema "fast track", operators should return an empty batch, with just schema, on the first call to {{next()}}. So, the scanner should return an empty batch (with schema) if a reader produces one (that is, skip null batches, return an empty batch.) Again, each operator should, on the first (preferably empty) batch, assemble output schema according to the rules for that operator. Do we have a spec and/or JIRA that describes the design behind the "fast schema" feature added shortly after 1.0? We should consult that to ensure the empty batch handling here is consistent with that design. > Schema change problems caused by empty batch > -------------------------------------------- > > Key: DRILL-5546 > URL: https://issues.apache.org/jira/browse/DRILL-5546 > Project: Apache Drill > Issue Type: Bug > Reporter: Jinfeng Ni > Assignee: Jinfeng Ni > > There have been a few JIRAs opened related to schema change failure caused by empty batch. This JIRA is opened as an umbrella for all those related JIRAS ( such as DRILL-4686, DRILL-4734, DRILL4476, DRILL-4255, etc). > -- This message was sent by Atlassian JIRA (v6.3.15#6346)