drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Hsuan-Yi Chu <hsua...@usc.edu>
Subject Re: Resolving ScanBatch.next() behavior for 0-row readers; handling of NONE, OK_NEW_SCHEMA
Date Sun, 20 Sep 2015 22:00:03 GMT
I think if it is not a star query.

It should be just like a query on non-empty file but with LIMIT 0. For
example:

"select a, b from `empty`" should produce a schema with columns a, b

If it is a star query, the output schema has no column.



On Sat, Sep 19, 2015 at 2:41 AM, Stefán Baxter <stefan@activitystream.com>
wrote:

> Nothing found. The same as if you query an empty table, empty view or empty
> anything.
>
> Granted that you get no indication of structure but returning nothing would
> hardly require a structure, or does it?
>
> I hardly qualify for this discussion but here are my (other) two cents:
>
>    - Ignore "[" at the start of a jason file and the "]" at the end
>    - this allows for querying of "valid" JSON files (small)
>
>    - Ignore incomplete entries at the end of files (that cause parsing
>    errors)
>    - this allows for the querying of "live" files that are being appended
> to
>
> Regards,
>  -Stefán
>
>
> On Sat, Sep 19, 2015 at 1:58 AM, Jacques Nadeau <jacques@dremio.com>
> wrote:
>
> > I think we should start with a much simpler discussion:
> >
> > If I query an empty file, what should we return?
> >
> > --
> > Jacques Nadeau
> > CTO and Co-Founder, Dremio
> >
> > On Fri, Sep 18, 2015 at 3:26 PM, Daniel Barclay <dbarclay@maprtech.com>
> > wrote:
> >
> > > What sequence of RecordBatch.IterOutcome<
> > >
> >
> https://github.com/dsbos/incubator-drill/blob/bugs/drill-3641/exec/java-exec/src/main/java/org/apache/drill/exec/record/RecordBatch.java#L106
> > > ><
> > >
> >
> https://github.com/dsbos/incubator-drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/record/RecordBatch.java#L41
> > >
> > > values should ScanBatch's next() return for a reader (file/etc.) that
> has
> > > zero rows of data, and what does that sequence depend on (e.g., whether
> > > there's still a non-empty schema even though there are no rows, whether
> > > there other files in the scan)?  [See other questions at bottom.]
> > >
> > >
> > > I'm trying to resolve this question to fix DRILL-2288 <
> > > https://issues.apache.org/jira/browse/DRILL-2288>. Its initial symptom
> > > was that INFORMATION_SCHEMA queries that return zero rows because of
> > > pushed-down filtering yielded results that have zero columns instead of
> > the
> > > expected columns.  An additional symptom was that "SELECT A, B, *" from
> > an
> > > empty JSON file yielded zero columns instead of the expected columns A
> > and
> > > B (with zero rows).
> > >
> > > The immediate cause of the problem (the missing schema information) was
> > > how ScanBatch.next() handled readers that returned no rows:
> > >
> > > If a reader has no rows at all, then the first call to its next()
> method
> > > (from ScanBatch.next()) returns zero (indicating that there are no more
> > > rows, and, in this case, no rows at all), and ScanBatch.next()'s call
> to
> > > the reader's mutator's isNewSchema() returns true, indicating that the
> > > reader has a schema that ScanBatch has not yet processed (e.g.,
> notified
> > > its caller about).
> > >
> > > The way ScanBatch.next()'s code checked those conditions, when the last
> > > reader had no rows at all, ScanBatch.next() returned IterOutcome.NONE.
> > >
> > > However, when that /last /reader was the /only /reader, that returning
> of
> > > IterOutcome.NONE for a no-rows reader by ScanBatch.next() meant that
> > next()
> > > never returned IterOutcome.OK_NEW_SCHEMA for that ScanBatch.
> > >
> > > That immediate return of NONE in turn meant that the downstream
> operator
> > > _never received a return value of __OK_NEW_SCHEMA__to trigger its
> schema
> > > processing_.  (For example, in the DRILL-2288 JSON case, the project
> > > operator never constructed its own schema containing columns A and B
> plus
> > > whatever columns (none) came from the empty JSON file; in DRILL-2288 's
> > > other case, the caller never propagated the statically known columns
> from
> > > the INFORMATION_SCHEMA table.)
> > >
> > > That returning of NONE without ever returning OK_NEW_SCHEMA also
> violates
> > > the (apparent) intended call/return protocol (sequence of IterOutcome
> > > values) for RecordBatch.next(). (See the draft Javadoc comments
> currently
> > > at RecordBatch.IterOutcome <
> > >
> >
> https://github.com/dsbos/incubator-drill/blob/bugs/drill-3641/exec/java-exec/src/main/java/org/apache/drill/exec/record/RecordBatch.java#L106
> > > >.)
> > >
> > >
> > > Therefore, it seems that ScanBatch.next() _must_ return OK_NEW_SCHEMA
> > > before returning NONE, instead of immediately returning NONE, for
> > > readers/files with zero rows for at least _some_ cases.  (It must both
> > > notify the downstream caller that there is a schema /and/ give the
> > caller a
> > > chance to read the schema (which is allowed after OK_NEW_SCHEMA is
> > returned
> > > but not after NONE).)
> > >
> > > However, it is not clear exactly what that set of cases is.  (It does
> not
> > > seem to be _all_ zero-row cases--returning OK_NEW_SCHEMA before
> returning
> > > NONE in all zero-row cases causes lots of errors about schema changes.)
> > >
> > > At a higher level, the question is how zero-row files/etc. should
> > interact
> > > with sibling files/etc. (i.e., when they do and don't cause a schema
> > > change).  Note that some kinds of files/sources still have a schema
> even
> > > when they have zero rows of data (e.g., Parquet files, right?), while
> > other
> > > kinds of files/source can't define (imply) any schema unless they have
> at
> > > least one row (e.g., JSON files).
> > >
> > >
> > > In my in-progress fix <
> > > https://github.com/dsbos/incubator-drill/tree/bugs/WORK_2288_3641_3659
> > >for
> > > DRILL-2288, I have currently changed ScanBatch.next()so that when the
> > last
> > > reader has zero rows and next()would have returned NONE, next() now
> > checks
> > > whether it has returned OK_NEW_SCHEMA yet (per any earlier
> > files/readers),
> > > and, if so, now returns OK_NEW_SCHEMA, still returning NONE if not.
> > (Note
> > > that, currently, that is regardless of whether the reader has no schema
> > (as
> > > from an empty JSON file) or has a schema.)
> > >
> > > That change fixed the DRILL-2288 symptoms (apparently by giving
> > > downstream/calling operators notification that they didn't get before).
> > >
> > > The change initially caused problems in UnionAllRecordBatch, because
> its
> > > code checked for NONE vs. OK_NEW_SCHEMA to try to detect empty inputs
> > > rather than checking directly. UnionAllRecordBatch has been fixed (in
> the
> > > in-progress fix for DRILL-2288).
> > >
> > > However, that change still causes other schema-change problems.  The
> > > additional returns of OK_NEW_SCHEMA are causing some code to perceive
> > > unprocessable schema changes.  It is not yet clear whether the code
> > should
> > > be checking the number of rows too, or OK_NEW_SCHEMA shouldn't be
> > returned
> > > in as many subcases of the no-rows last-reader/file case.
> > >
> > >
> > > So, some open and potential questions seem to be:
> > >
> > > 1. Is it the case that a) any batch's next() should return
> OK_NEW_SCHEMA
> > > before it returns NONE, and callers/downstream batches should be able
> to
> > > count on getting OK_NEW_SCHEMA (e.g., to trigger setting up their
> > > downstream schemas), or that b) empty files can cause next() to return
> > NONE
> > > without ever returning OK_NEW_SCHEMA , and therefore all downstream
> batch
> > > classes must handle getting NONE before they have set up their schemas?
> > > 2. For a file/source kind that has a schema even when there are no
> rows,
> > > should getting an empty file constitute a schema change?  (On one hand
> > > there are no actual /rows/ (following the new schema) conflicting with
> > any
> > > previous schema (and maybe rows), but on the other hand there is a
> > > non-empty /schema /that can conflict when that's enough to matter.)
> > > 3. For a file/source kind that implies a schema only when there are
> rows
> > > (e.g., JSON), when should or shouldn't that be considered a schema
> > change?
> > > If ScanBatch reads non-empty JSON file A, reads empty JSON file B, and
> > > reads non-empty JSON file C implying the same schema as A did, should
> > that
> > > be considered to not be schema change or not?  (When reading
> > > no-/empty-schema B, should ScanBatch the keep the schema from A and
> check
> > > against that when it gets to C, effectively ignoring the existence of B
> > > completely?)
> > > 4. In ScanBatch.next(), when the last reader had no rows at all, when
> > > should next() return OK_NEW_SCHEMA? always? /iff/ the reader has a
> > > non-empty schema?  just enough to never return NONE before returning
> > > OK_NEW_SCHEMA (which means it acts differently for otherwise-identical
> > > empty files, depending on what happened with previous readers)?  as in
> > that
> > > last case except only if the reader has a non-empty schema?
> > >
> > > Thanks,
> > > Daniel
> > >
> > > --
> > > Daniel Barclay
> > > MapR Technologies
> > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message