drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Daniel Barclay <dbarc...@maprtech.com>
Subject Re: Resolving ScanBatch.next() behavior for 0-row readers; handling of NONE, OK_NEW_SCHEMA
Date Mon, 28 Sep 2015 21:29:09 GMT
I found the root cause of the difficulty in fixing ScanBatch.next() that led to the questions
below, so some of the question might be moot, and the rest have dropped in priority.

(The JSON reader has been reporting an extra schema change at the end of (some) complex-type
JSON files because a check-and-reset method was called on the right side of a short-circuit
operator. (When a top-level schema-change flag was set, a nested-schema-change status wasn't
cleared, so at the end of the file, the JSON reader reported a spurious schema change (with
zero records).))


I wrote:
> What sequence of RecordBatch.IterOutcome values should ScanBatch's next() return for
a reader (file/etc.) that has zero rows of data, and what does that sequence depend on (e.g.,
whether there's still a non-empty schema even though there are no rows, whether there other
files in the scan)?  [See other questions at bottom.]
> I'm trying to resolve this question to fix DRILL-2288 <https://issues.apache.org/jira/browse/DRILL-2288>.
Its initial symptom was that INFORMATION_SCHEMA queries that return zero rows because of pushed-down
filtering yielded results that have zero columns instead of the expected columns. An additional
symptom was that "SELECT A, B, *" from an empty JSON file yielded zero columns instead of
the expected columns A and B (with zero rows).
> The immediate cause of the problem (the missing schema information) was how ScanBatch.next()
handled readers that returned no rows:
> If a reader has no rows at all, then the first call to its next() method (from ScanBatch.next())
returns zero (indicating that there are no more rows, and, in this case, no rows at all),
and ScanBatch.next()'s call to the reader's mutator's isNewSchema() returns true, indicating
that the reader has a schema that ScanBatch has not yet processed (e.g., notified its caller
> The way ScanBatch.next()'s code checked those conditions, when the last reader had no
rows at all, ScanBatch.next() returned IterOutcome.NONE.
> However, when that /last /reader was the /only /reader, that returning of IterOutcome.NONE
for a no-rows reader by ScanBatch.next() meant that next() never returned IterOutcome.OK_NEW_SCHEMA
for that ScanBatch.
> That immediate return of NONE in turn meant that the downstream operator _never received
a return value of __OK_NEW_SCHEMA__to trigger its schema processing_.  (For example, in the
DRILL-2288 JSON case, the project operator never constructed its own schema containing columns
A and B plus whatever columns (none) came from the empty JSON file; in DRILL-2288 's other
case, the caller never propagated the statically known columns from the INFORMATION_SCHEMA
> That returning of NONE without ever returning OK_NEW_SCHEMA also violates the (apparent)
intended call/return protocol (sequence of IterOutcome values) for RecordBatch.next(). (See
the draft Javadoc comments currently at RecordBatch.IterOutcome <https://github.com/dsbos/incubator-drill/blob/bugs/drill-3641/exec/java-exec/src/main/java/org/apache/drill/exec/record/RecordBatch.java#L106>.)
> Therefore, it seems that ScanBatch.next() _must_ return OK_NEW_SCHEMA before returning
NONE, instead of immediately returning NONE, for readers/files with zero rows for at least
_some_ cases.  (It must both notify the downstream caller that there is a schema /and/ give
the caller a chance to read the schema (which is allowed after OK_NEW_SCHEMA is returned but
not after NONE).)
> However, it is not clear exactly what that set of cases is.  (It does not seem to be
_all_ zero-row cases--returning OK_NEW_SCHEMA before returning NONE in all zero-row cases
causes lots of errors about schema changes.)
> At a higher level, the question is how zero-row files/etc. should interact with sibling
files/etc. (i.e., when they do and don't cause a schema change).  Note that some kinds of
files/sources still have a schema even when they have zero rows of data (e.g., Parquet files,
right?), while other kinds of files/source can't define (imply) any schema unless they have
at least one row (e.g., JSON files).
> In my in-progress fix <https://github.com/dsbos/incubator-drill/tree/bugs/WORK_2288_3641_3659>for
DRILL-2288, I have currently changed ScanBatch.next()so that when the last reader has zero
rows and next()would have returned NONE, next() now checks whether it has returned OK_NEW_SCHEMA
yet (per any earlier files/readers), and, if so, now returns OK_NEW_SCHEMA, still returning
NONE if not.  (Note that, currently, that is regardless of whether the reader has no schema
(as from an empty JSON file) or has a schema.)
> That change fixed the DRILL-2288 symptoms (apparently by giving downstream/calling operators
notification that they didn't get before).
> The change initially caused problems in UnionAllRecordBatch, because its code checked
for NONE vs. OK_NEW_SCHEMA to try to detect empty inputs rather than checking directly. UnionAllRecordBatch
has been fixed (in the in-progress fix for DRILL-2288).
> However, that change still causes other schema-change problems. The additional returns
of OK_NEW_SCHEMA are causing some code to perceive unprocessable schema changes.  It is not
yet clear whether the code should be checking the number of rows too, or OK_NEW_SCHEMA shouldn't
be returned in as many subcases of the no-rows last-reader/file case.
> So, some open and potential questions seem to be:
>  1. Is it the case that a) any batch's next() should return OK_NEW_SCHEMA before it returns
NONE, and callers/downstream batches should be able to count on getting OK_NEW_SCHEMA (e.g.,
to trigger setting up their downstream schemas), or that b) empty files can cause next() to
return NONE without ever returning OK_NEW_SCHEMA , and therefore all downstream batch classes
must handle getting NONE before they have set up their schemas?
>  2. For a file/source kind that has a schema even when there are no rows, should getting
an empty file constitute a schema change?  (On one hand there are no actual /rows/ (following
the new schema) conflicting with any previous schema (and maybe rows), but on the other hand
there is a non-empty /schema /that can conflict when that's enough to matter.)
>  3. For a file/source kind that implies a schema only when there are rows (e.g., JSON),
when should or shouldn't that be considered a schema change?  If ScanBatch reads non-empty
JSON file A, reads empty JSON file B, and reads non-empty JSON file C implying the same schema
as A did, should that be considered to not be schema change or not? (When reading no-/empty-schema
B, should ScanBatch the keep the schema from A and check against that when it gets to C, effectively
ignoring the existence of B completely?)
>  4. In ScanBatch.next(), when the last reader had no rows at all, when should next()
return OK_NEW_SCHEMA? always? /iff/ the reader has a non-empty schema?  just enough to never
return NONE before returning OK_NEW_SCHEMA (which means it acts differently for otherwise-identical
empty files, depending on what happened with previous readers)?  as in that last case except
only if the reader has a non-empty schema?
> Thanks,
> Daniel
> -- 
> Daniel Barclay
> MapR Technologies

Daniel Barclay
MapR Technologies

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message