Return-Path: X-Original-To: apmail-drill-dev-archive@www.apache.org Delivered-To: apmail-drill-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id B753718B07 for ; Sun, 20 Sep 2015 22:00:22 +0000 (UTC) Received: (qmail 74596 invoked by uid 500); 20 Sep 2015 22:00:22 -0000 Delivered-To: apmail-drill-dev-archive@drill.apache.org Received: (qmail 74541 invoked by uid 500); 20 Sep 2015 22:00:22 -0000 Mailing-List: contact dev-help@drill.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@drill.apache.org Delivered-To: mailing list dev@drill.apache.org Received: (qmail 74521 invoked by uid 99); 20 Sep 2015 22:00:21 -0000 Received: from Unknown (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 20 Sep 2015 22:00:21 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id 82B4D180333 for ; Sun, 20 Sep 2015 22:00:21 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 3 X-Spam-Level: *** X-Spam-Status: No, score=3 tagged_above=-999 required=6.31 tests=[HTML_MESSAGE=3, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=disabled Received: from mx1-us-west.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id 62tuL-Lk_bxO for ; Sun, 20 Sep 2015 22:00:11 +0000 (UTC) Received: from mail-vk0-f53.google.com (mail-vk0-f53.google.com [209.85.213.53]) by mx1-us-west.apache.org (ASF Mail Server at mx1-us-west.apache.org) with ESMTPS id B1AB72074F for ; Sun, 20 Sep 2015 22:00:10 +0000 (UTC) Received: by vkao3 with SMTP id o3so55128405vka.2 for ; Sun, 20 Sep 2015 15:00:03 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:date :message-id:subject:from:to:content-type; bh=g4Hypc4GGwf8xQ8+gsonwDPt1qcwNV0bUgabU3YaNxE=; b=X0B0dM4jo3ASf/YY0vZAj81h3fjmEppQJr7Bh/ug8e6WdMQkGqIpYBAn/1DzO4IqmX QKAIzJ7y1ShNLkdEzKBOsHYtPmaOxgiR6o6V5d14kwGC3Zd1EWcb/GDaj6wFfJoqYJn1 4N5O2sNwe22YyJTy6URwD4qtE6dYwqgujfbiuNA+P1xF+PScuuc4TsTQVYXApYzIlIzC xb8EEJcmBonc10f63l9LoCPbl0Ho/lTZP900tYKqc7o+Gesn+afkRE7gpHWcYX8f9fFl U8CxizWpKtW36CxBLpIU6Oa+pk4Xq887mHXhC1sXaolQhITb81yCxWCjUPDURY5oVr8U PepA== X-Gm-Message-State: ALoCoQnG5t96XXecd5ilvyIIxgiIW5DwAPvedLJYu+KneHnwupHsWPOFzLLBKZ740+3xm1M41o1o MIME-Version: 1.0 X-Received: by 10.31.164.146 with SMTP id n140mr10281026vke.148.1442786403611; Sun, 20 Sep 2015 15:00:03 -0700 (PDT) Received: by 10.103.51.213 with HTTP; Sun, 20 Sep 2015 15:00:03 -0700 (PDT) In-Reply-To: References: <55FC8FAB.8080009@maprtech.com> Date: Sun, 20 Sep 2015 15:00:03 -0700 Message-ID: Subject: Re: Resolving ScanBatch.next() behavior for 0-row readers; handling of NONE, OK_NEW_SCHEMA From: Hsuan-Yi Chu To: dev@drill.apache.org Content-Type: multipart/alternative; boundary=001a11414f88130e08052034e375 --001a11414f88130e08052034e375 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable I think if it is not a star query. It should be just like a query on non-empty file but with LIMIT 0. For example: "select a, b from `empty`" should produce a schema with columns a, b If it is a star query, the output schema has no column. On Sat, Sep 19, 2015 at 2:41 AM, Stef=C3=A1n Baxter wrote: > Nothing found. The same as if you query an empty table, empty view or emp= ty > anything. > > Granted that you get no indication of structure but returning nothing wou= ld > hardly require a structure, or does it? > > I hardly qualify for this discussion but here are my (other) two cents: > > - Ignore "[" at the start of a jason file and the "]" at the end > - this allows for querying of "valid" JSON files (small) > > - Ignore incomplete entries at the end of files (that cause parsing > errors) > - this allows for the querying of "live" files that are being appended > to > > Regards, > -Stef=C3=A1n > > > On Sat, Sep 19, 2015 at 1:58 AM, Jacques Nadeau > wrote: > > > I think we should start with a much simpler discussion: > > > > If I query an empty file, what should we return? > > > > -- > > Jacques Nadeau > > CTO and Co-Founder, Dremio > > > > On Fri, Sep 18, 2015 at 3:26 PM, Daniel Barclay > > wrote: > > > > > What sequence of RecordBatch.IterOutcome< > > > > > > https://github.com/dsbos/incubator-drill/blob/bugs/drill-3641/exec/java-e= xec/src/main/java/org/apache/drill/exec/record/RecordBatch.java#L106 > > > >< > > > > > > https://github.com/dsbos/incubator-drill/blob/master/exec/java-exec/src/m= ain/java/org/apache/drill/exec/record/RecordBatch.java#L41 > > > > > > values should ScanBatch's next() return for a reader (file/etc.) that > has > > > zero rows of data, and what does that sequence depend on (e.g., wheth= er > > > there's still a non-empty schema even though there are no rows, wheth= er > > > there other files in the scan)? [See other questions at bottom.] > > > > > > > > > I'm trying to resolve this question to fix DRILL-2288 < > > > https://issues.apache.org/jira/browse/DRILL-2288>. Its initial sympto= m > > > was that INFORMATION_SCHEMA queries that return zero rows because of > > > pushed-down filtering yielded results that have zero columns instead = of > > the > > > expected columns. An additional symptom was that "SELECT A, B, *" fr= om > > an > > > empty JSON file yielded zero columns instead of the expected columns = A > > and > > > B (with zero rows). > > > > > > The immediate cause of the problem (the missing schema information) w= as > > > how ScanBatch.next() handled readers that returned no rows: > > > > > > If a reader has no rows at all, then the first call to its next() > method > > > (from ScanBatch.next()) returns zero (indicating that there are no mo= re > > > rows, and, in this case, no rows at all), and ScanBatch.next()'s call > to > > > the reader's mutator's isNewSchema() returns true, indicating that th= e > > > reader has a schema that ScanBatch has not yet processed (e.g., > notified > > > its caller about). > > > > > > The way ScanBatch.next()'s code checked those conditions, when the la= st > > > reader had no rows at all, ScanBatch.next() returned IterOutcome.NONE= . > > > > > > However, when that /last /reader was the /only /reader, that returnin= g > of > > > IterOutcome.NONE for a no-rows reader by ScanBatch.next() meant that > > next() > > > never returned IterOutcome.OK_NEW_SCHEMA for that ScanBatch. > > > > > > That immediate return of NONE in turn meant that the downstream > operator > > > _never received a return value of __OK_NEW_SCHEMA__to trigger its > schema > > > processing_. (For example, in the DRILL-2288 JSON case, the project > > > operator never constructed its own schema containing columns A and B > plus > > > whatever columns (none) came from the empty JSON file; in DRILL-2288 = 's > > > other case, the caller never propagated the statically known columns > from > > > the INFORMATION_SCHEMA table.) > > > > > > That returning of NONE without ever returning OK_NEW_SCHEMA also > violates > > > the (apparent) intended call/return protocol (sequence of IterOutcome > > > values) for RecordBatch.next(). (See the draft Javadoc comments > currently > > > at RecordBatch.IterOutcome < > > > > > > https://github.com/dsbos/incubator-drill/blob/bugs/drill-3641/exec/java-e= xec/src/main/java/org/apache/drill/exec/record/RecordBatch.java#L106 > > > >.) > > > > > > > > > Therefore, it seems that ScanBatch.next() _must_ return OK_NEW_SCHEMA > > > before returning NONE, instead of immediately returning NONE, for > > > readers/files with zero rows for at least _some_ cases. (It must bot= h > > > notify the downstream caller that there is a schema /and/ give the > > caller a > > > chance to read the schema (which is allowed after OK_NEW_SCHEMA is > > returned > > > but not after NONE).) > > > > > > However, it is not clear exactly what that set of cases is. (It does > not > > > seem to be _all_ zero-row cases--returning OK_NEW_SCHEMA before > returning > > > NONE in all zero-row cases causes lots of errors about schema changes= .) > > > > > > At a higher level, the question is how zero-row files/etc. should > > interact > > > with sibling files/etc. (i.e., when they do and don't cause a schema > > > change). Note that some kinds of files/sources still have a schema > even > > > when they have zero rows of data (e.g., Parquet files, right?), while > > other > > > kinds of files/source can't define (imply) any schema unless they hav= e > at > > > least one row (e.g., JSON files). > > > > > > > > > In my in-progress fix < > > > https://github.com/dsbos/incubator-drill/tree/bugs/WORK_2288_3641_365= 9 > > >for > > > DRILL-2288, I have currently changed ScanBatch.next()so that when the > > last > > > reader has zero rows and next()would have returned NONE, next() now > > checks > > > whether it has returned OK_NEW_SCHEMA yet (per any earlier > > files/readers), > > > and, if so, now returns OK_NEW_SCHEMA, still returning NONE if not. > > (Note > > > that, currently, that is regardless of whether the reader has no sche= ma > > (as > > > from an empty JSON file) or has a schema.) > > > > > > That change fixed the DRILL-2288 symptoms (apparently by giving > > > downstream/calling operators notification that they didn't get before= ). > > > > > > The change initially caused problems in UnionAllRecordBatch, because > its > > > code checked for NONE vs. OK_NEW_SCHEMA to try to detect empty inputs > > > rather than checking directly. UnionAllRecordBatch has been fixed (in > the > > > in-progress fix for DRILL-2288). > > > > > > However, that change still causes other schema-change problems. The > > > additional returns of OK_NEW_SCHEMA are causing some code to perceive > > > unprocessable schema changes. It is not yet clear whether the code > > should > > > be checking the number of rows too, or OK_NEW_SCHEMA shouldn't be > > returned > > > in as many subcases of the no-rows last-reader/file case. > > > > > > > > > So, some open and potential questions seem to be: > > > > > > 1. Is it the case that a) any batch's next() should return > OK_NEW_SCHEMA > > > before it returns NONE, and callers/downstream batches should be able > to > > > count on getting OK_NEW_SCHEMA (e.g., to trigger setting up their > > > downstream schemas), or that b) empty files can cause next() to retur= n > > NONE > > > without ever returning OK_NEW_SCHEMA , and therefore all downstream > batch > > > classes must handle getting NONE before they have set up their schema= s? > > > 2. For a file/source kind that has a schema even when there are no > rows, > > > should getting an empty file constitute a schema change? (On one han= d > > > there are no actual /rows/ (following the new schema) conflicting wit= h > > any > > > previous schema (and maybe rows), but on the other hand there is a > > > non-empty /schema /that can conflict when that's enough to matter.) > > > 3. For a file/source kind that implies a schema only when there are > rows > > > (e.g., JSON), when should or shouldn't that be considered a schema > > change? > > > If ScanBatch reads non-empty JSON file A, reads empty JSON file B, an= d > > > reads non-empty JSON file C implying the same schema as A did, should > > that > > > be considered to not be schema change or not? (When reading > > > no-/empty-schema B, should ScanBatch the keep the schema from A and > check > > > against that when it gets to C, effectively ignoring the existence of= B > > > completely?) > > > 4. In ScanBatch.next(), when the last reader had no rows at all, when > > > should next() return OK_NEW_SCHEMA? always? /iff/ the reader has a > > > non-empty schema? just enough to never return NONE before returning > > > OK_NEW_SCHEMA (which means it acts differently for otherwise-identica= l > > > empty files, depending on what happened with previous readers)? as i= n > > that > > > last case except only if the reader has a non-empty schema? > > > > > > Thanks, > > > Daniel > > > > > > -- > > > Daniel Barclay > > > MapR Technologies > > > > > > > > > --001a11414f88130e08052034e375--