drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From jinfengni <...@git.apache.org>
Subject [GitHub] drill pull request #906: DRILL-5546: Handle schema change exception failure ...
Date Sat, 26 Aug 2017 06:21:52 GMT
Github user jinfengni commented on a diff in the pull request:

    --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/ScanBatch.java
    @@ -152,97 +157,75 @@ public void kill(boolean sendUpstream) {
    -  private void releaseAssets() {
    -    container.zeroVectors();
    -  }
    -  private void clearFieldVectorMap() {
    -    for (final ValueVector v : mutator.fieldVectorMap().values()) {
    -      v.clear();
    -    }
    -  }
       public IterOutcome next() {
         if (done) {
           return IterOutcome.NONE;
         try {
    -      try {
    -        injector.injectChecked(context.getExecutionControls(), "next-allocate", OutOfMemoryException.class);
    -        currentReader.allocate(mutator.fieldVectorMap());
    -      } catch (OutOfMemoryException e) {
    -        clearFieldVectorMap();
    -        throw UserException.memoryError(e).build(logger);
    -      }
    -      while ((recordCount = currentReader.next()) == 0) {
    +      while (true) {
             try {
    -          if (!readers.hasNext()) {
    -            // We're on the last reader, and it has no (more) rows.
    -            currentReader.close();
    -            releaseAssets();
    -            done = true;  // have any future call to next() return NONE
    -            if (mutator.isNewSchema()) {
    -              // This last reader has a new schema (e.g., we have a zero-row
    -              // file or other source).  (Note that some sources have a non-
    -              // null/non-trivial schema even when there are no rows.)
    +          injector.injectChecked(context.getExecutionControls(), "next-allocate", OutOfMemoryException.class);
    --- End diff --
    This patch tries to decouple the logic of record reader and scanbatch:
     - Record reader is responsible to add vectors to batch (via Mutator), and populate data
     - ScanBatch is responsible to interpret the output of record reader, by checking rowCount
&& Mutator.isNewSchema() to decide whether return OK_NEW_SCHEMA, OK, or NONE. 
    > What happens on the first reader? There is no schema, so any schema is a new schema.
Suppose the file is JSON and the schema is built on the fly. Does the code handle the case
that we have no schema (first reader), and that reader adds no columns?
    It's not true "any schema is a new schema". If the first reader has no schema and adds
no columns, then Mutator.isNewSchema() should return false. Mutator.isNewSchema() returns
true only after the last call, one or more happens
    - a new top level field is added, 
    - a field in a nested field is added, 
    - an existing field type is changed
    You may argue a more appropriate way to represent an empty JSON file is an empty schema.
However, such idea would lead to various schema conflicts in down-stream operator, if other
scan thread has non-empty JSON files. This is exactly what happened before this patch. 
    The proposal in this patch is to **ignore** empty JSON, since 1)rowCount=0, 2)no new column
were added to batch. 
     - If all the record readers for a scan thread return with rowCount = 0, and produce no
new schema, then this Scan thread should return 'NONE' directly, without returning OK_NEW_SCHEMA.

     - If at least one of reader return either with >0 row, or new schema, then Scan thread
will return batch with new schema.
    - If all scan threads returns 'NONE' directly, implying the entire table does not have
data/schema, this is what Project.handleFastNone() will deal with.
    >But, if the input is CSV, then we always have a schema. If the file has column headers,
then we know that the schema is, say, (a, b, c) because those are the headers. Or, if the
file has no headers, the schema is always the columns array. So, should we send that schema
downstream? If so, should it include the implicit columns?
    If CSV always adds columns (either _a,b,c, or columns_), then ScanBatch will produce a
batch with (a, b, c), or columns. It does not make sense to ignore those schema.  
      - In the case of file with header,  file with _a,b,c_ will lead to a batch with (a,b,c)
while a file with _a,b,c,d_ will lead to a batch with (a,b,c,d). Those two files will cause
a schema change, which is expected behavior.
      - In the case of file without header, all files will produce a batch with columns, which
means there would be no schema change across different files, regardless whether they have
row=0, or row > 0.

If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.

View raw message