drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stefán Baxter <ste...@activitystream.com>
Subject Re: Resolving ScanBatch.next() behavior for 0-row readers; handling of NONE, OK_NEW_SCHEMA
Date Sat, 19 Sep 2015 09:41:58 GMT
Nothing found. The same as if you query an empty table, empty view or empty
anything.

Granted that you get no indication of structure but returning nothing would
hardly require a structure, or does it?

I hardly qualify for this discussion but here are my (other) two cents:

   - Ignore "[" at the start of a jason file and the "]" at the end
   - this allows for querying of "valid" JSON files (small)

   - Ignore incomplete entries at the end of files (that cause parsing
   errors)
   - this allows for the querying of "live" files that are being appended to

Regards,
 -Stefán


On Sat, Sep 19, 2015 at 1:58 AM, Jacques Nadeau <jacques@dremio.com> wrote:

> I think we should start with a much simpler discussion:
>
> If I query an empty file, what should we return?
>
> --
> Jacques Nadeau
> CTO and Co-Founder, Dremio
>
> On Fri, Sep 18, 2015 at 3:26 PM, Daniel Barclay <dbarclay@maprtech.com>
> wrote:
>
> > What sequence of RecordBatch.IterOutcome<
> >
> https://github.com/dsbos/incubator-drill/blob/bugs/drill-3641/exec/java-exec/src/main/java/org/apache/drill/exec/record/RecordBatch.java#L106
> > ><
> >
> https://github.com/dsbos/incubator-drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/record/RecordBatch.java#L41
> >
> > values should ScanBatch's next() return for a reader (file/etc.) that has
> > zero rows of data, and what does that sequence depend on (e.g., whether
> > there's still a non-empty schema even though there are no rows, whether
> > there other files in the scan)?  [See other questions at bottom.]
> >
> >
> > I'm trying to resolve this question to fix DRILL-2288 <
> > https://issues.apache.org/jira/browse/DRILL-2288>. Its initial symptom
> > was that INFORMATION_SCHEMA queries that return zero rows because of
> > pushed-down filtering yielded results that have zero columns instead of
> the
> > expected columns.  An additional symptom was that "SELECT A, B, *" from
> an
> > empty JSON file yielded zero columns instead of the expected columns A
> and
> > B (with zero rows).
> >
> > The immediate cause of the problem (the missing schema information) was
> > how ScanBatch.next() handled readers that returned no rows:
> >
> > If a reader has no rows at all, then the first call to its next() method
> > (from ScanBatch.next()) returns zero (indicating that there are no more
> > rows, and, in this case, no rows at all), and ScanBatch.next()'s call to
> > the reader's mutator's isNewSchema() returns true, indicating that the
> > reader has a schema that ScanBatch has not yet processed (e.g., notified
> > its caller about).
> >
> > The way ScanBatch.next()'s code checked those conditions, when the last
> > reader had no rows at all, ScanBatch.next() returned IterOutcome.NONE.
> >
> > However, when that /last /reader was the /only /reader, that returning of
> > IterOutcome.NONE for a no-rows reader by ScanBatch.next() meant that
> next()
> > never returned IterOutcome.OK_NEW_SCHEMA for that ScanBatch.
> >
> > That immediate return of NONE in turn meant that the downstream operator
> > _never received a return value of __OK_NEW_SCHEMA__to trigger its schema
> > processing_.  (For example, in the DRILL-2288 JSON case, the project
> > operator never constructed its own schema containing columns A and B plus
> > whatever columns (none) came from the empty JSON file; in DRILL-2288 's
> > other case, the caller never propagated the statically known columns from
> > the INFORMATION_SCHEMA table.)
> >
> > That returning of NONE without ever returning OK_NEW_SCHEMA also violates
> > the (apparent) intended call/return protocol (sequence of IterOutcome
> > values) for RecordBatch.next(). (See the draft Javadoc comments currently
> > at RecordBatch.IterOutcome <
> >
> https://github.com/dsbos/incubator-drill/blob/bugs/drill-3641/exec/java-exec/src/main/java/org/apache/drill/exec/record/RecordBatch.java#L106
> > >.)
> >
> >
> > Therefore, it seems that ScanBatch.next() _must_ return OK_NEW_SCHEMA
> > before returning NONE, instead of immediately returning NONE, for
> > readers/files with zero rows for at least _some_ cases.  (It must both
> > notify the downstream caller that there is a schema /and/ give the
> caller a
> > chance to read the schema (which is allowed after OK_NEW_SCHEMA is
> returned
> > but not after NONE).)
> >
> > However, it is not clear exactly what that set of cases is.  (It does not
> > seem to be _all_ zero-row cases--returning OK_NEW_SCHEMA before returning
> > NONE in all zero-row cases causes lots of errors about schema changes.)
> >
> > At a higher level, the question is how zero-row files/etc. should
> interact
> > with sibling files/etc. (i.e., when they do and don't cause a schema
> > change).  Note that some kinds of files/sources still have a schema even
> > when they have zero rows of data (e.g., Parquet files, right?), while
> other
> > kinds of files/source can't define (imply) any schema unless they have at
> > least one row (e.g., JSON files).
> >
> >
> > In my in-progress fix <
> > https://github.com/dsbos/incubator-drill/tree/bugs/WORK_2288_3641_3659
> >for
> > DRILL-2288, I have currently changed ScanBatch.next()so that when the
> last
> > reader has zero rows and next()would have returned NONE, next() now
> checks
> > whether it has returned OK_NEW_SCHEMA yet (per any earlier
> files/readers),
> > and, if so, now returns OK_NEW_SCHEMA, still returning NONE if not.
> (Note
> > that, currently, that is regardless of whether the reader has no schema
> (as
> > from an empty JSON file) or has a schema.)
> >
> > That change fixed the DRILL-2288 symptoms (apparently by giving
> > downstream/calling operators notification that they didn't get before).
> >
> > The change initially caused problems in UnionAllRecordBatch, because its
> > code checked for NONE vs. OK_NEW_SCHEMA to try to detect empty inputs
> > rather than checking directly. UnionAllRecordBatch has been fixed (in the
> > in-progress fix for DRILL-2288).
> >
> > However, that change still causes other schema-change problems.  The
> > additional returns of OK_NEW_SCHEMA are causing some code to perceive
> > unprocessable schema changes.  It is not yet clear whether the code
> should
> > be checking the number of rows too, or OK_NEW_SCHEMA shouldn't be
> returned
> > in as many subcases of the no-rows last-reader/file case.
> >
> >
> > So, some open and potential questions seem to be:
> >
> > 1. Is it the case that a) any batch's next() should return OK_NEW_SCHEMA
> > before it returns NONE, and callers/downstream batches should be able to
> > count on getting OK_NEW_SCHEMA (e.g., to trigger setting up their
> > downstream schemas), or that b) empty files can cause next() to return
> NONE
> > without ever returning OK_NEW_SCHEMA , and therefore all downstream batch
> > classes must handle getting NONE before they have set up their schemas?
> > 2. For a file/source kind that has a schema even when there are no rows,
> > should getting an empty file constitute a schema change?  (On one hand
> > there are no actual /rows/ (following the new schema) conflicting with
> any
> > previous schema (and maybe rows), but on the other hand there is a
> > non-empty /schema /that can conflict when that's enough to matter.)
> > 3. For a file/source kind that implies a schema only when there are rows
> > (e.g., JSON), when should or shouldn't that be considered a schema
> change?
> > If ScanBatch reads non-empty JSON file A, reads empty JSON file B, and
> > reads non-empty JSON file C implying the same schema as A did, should
> that
> > be considered to not be schema change or not?  (When reading
> > no-/empty-schema B, should ScanBatch the keep the schema from A and check
> > against that when it gets to C, effectively ignoring the existence of B
> > completely?)
> > 4. In ScanBatch.next(), when the last reader had no rows at all, when
> > should next() return OK_NEW_SCHEMA? always? /iff/ the reader has a
> > non-empty schema?  just enough to never return NONE before returning
> > OK_NEW_SCHEMA (which means it acts differently for otherwise-identical
> > empty files, depending on what happened with previous readers)?  as in
> that
> > last case except only if the reader has a non-empty schema?
> >
> > Thanks,
> > Daniel
> >
> > --
> > Daniel Barclay
> > MapR Technologies
> >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message