drill-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (DRILL-5546) Schema change problems caused by empty batch
Date Mon, 21 Aug 2017 19:54:02 GMT

    [ https://issues.apache.org/jira/browse/DRILL-5546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16135710#comment-16135710
] 

ASF GitHub Bot commented on DRILL-5546:
---------------------------------------

Github user paul-rogers commented on a diff in the pull request:

    https://github.com/apache/drill/pull/906#discussion_r134305681
  
    --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/ScanBatch.java
---
    @@ -152,97 +157,75 @@ public void kill(boolean sendUpstream) {
         }
       }
     
    -  private void releaseAssets() {
    -    container.zeroVectors();
    -  }
    -
    -  private void clearFieldVectorMap() {
    -    for (final ValueVector v : mutator.fieldVectorMap().values()) {
    -      v.clear();
    -    }
    -  }
    -
       @Override
       public IterOutcome next() {
         if (done) {
           return IterOutcome.NONE;
         }
         oContext.getStats().startProcessing();
         try {
    -      try {
    -        injector.injectChecked(context.getExecutionControls(), "next-allocate", OutOfMemoryException.class);
    -
    -        currentReader.allocate(mutator.fieldVectorMap());
    -      } catch (OutOfMemoryException e) {
    -        clearFieldVectorMap();
    -        throw UserException.memoryError(e).build(logger);
    -      }
    -      while ((recordCount = currentReader.next()) == 0) {
    +      while (true) {
             try {
    -          if (!readers.hasNext()) {
    -            // We're on the last reader, and it has no (more) rows.
    -            currentReader.close();
    -            releaseAssets();
    -            done = true;  // have any future call to next() return NONE
    -
    -            if (mutator.isNewSchema()) {
    -              // This last reader has a new schema (e.g., we have a zero-row
    -              // file or other source).  (Note that some sources have a non-
    -              // null/non-trivial schema even when there are no rows.)
    +          injector.injectChecked(context.getExecutionControls(), "next-allocate", OutOfMemoryException.class);
    --- End diff --
    
    From the PR description:
    
    > 1. Skip RecordReader if it returns 0 row && present same schema. A new schema
(by calling Mutator.isNewSchema() ) means either a new top level field is added, or a field
in a nested field is added, or an existing field type is changed.
    
    The code, however, adds an additional condition: if implicit fields change. (But, as noted
below, that should never occur in practice.)
    
    What happens on the first reader? There is no schema, so *any* schema is a new schema.
Suppose the file is JSON and the schema is built on the fly. Does the code handle the case
that we have no schema (first reader), and that reader adds no columns?
    
    Or, according to the logic that the downstream wants to know the schema, even if there
are no records, do we send an empty schema (schema with no columns) downstream, because that
is an accurate representation of an empty JSON file?
    
    What happens in the case of an empty JSON file following a non-empty file? In this case,
do we consider the empty schema as a schema change relative to a non-empty change?
    
    In short, can we generalize this first rule a bit?
    
    > 2. Implicit columns are added and populated only when the input is not empty, i.e.
the batch contains > 0 row or rowCount == 0 && new schema.
    
    How does this interact with a scan batch that has only one file, and that file is empty?
Would we return the empty schema downstream? With the implicit columns?
    
    > 3. ScanBatch will return NONE directly (called as "fast NONE"), if all its RecordReaders
haver empty input and thus are skipped, in stead of returing OK_NEW_SCHEMA first.
    
    
    This is just a bit ambiguous. If the reader is JSON, then an empty file has an empty schema
(for reasons cited above.)
    
    But, if the input is CSV, then we *always* have a schema. If the file has column headers,
then we know that the schema is, say, (a, b, c) because those are the headers. Or, if the
file has no headers, the schema is always the `columns` array. So, should we send that schema
downstream? If so, should it include the implicit columns?
    
    This, in fact, raises another issue (out of scope for this PR): if we return an empty
batch with non-empty schema, we have no place to attach the implicit columns that will allow
the user to figure out that, say, "foo.csv" is empty.
    
    On the other hand, if we say that an empty CSV file has no schema, then we can skip that
file. The same might be true of JSON. What about Parquet? We'd have a schema even if there
are no rows. Same with JDBC. Should we return this schema, even if the data is empty?
    
    Finally, do we need special handling for "null" files: a file that no longer exists on
disk and so has a completely undefined schema? An undefined schema is different than an empty
schema. Undefined = "we just don't know what the schema might be." Empty = "we know there
is a schema and that schema is empty." A missing CSV file has an undefined schema. We could
argue that a 0-length (or, a one-byte file with the only content being a newline) is a valid
file, but it is a file with no columns, hence an empty schema.
    
    Or, should we have a rule that says something like:
    
    1. If a file is missing, or has an undefined schema, skip it.
    2. If a file has a schema (empty or not) and no rows, then:
      * If this is the first file, remember the schema and skip to the next.
      * If this file follows a non-empty file, trigger a schema change.
    4. If a file has a rows, then:
      * If the previous schema was empty, ignore that empty schema.
      * If the previous schema was non-empty, and different, return OK_NEW_SCHEMA
      * If the previous schema was the same, return OK.
    5. At EOF:
      * If the previous schema was empty, and the previous file had no rows, return that schema.
      * If the previous schema was empty, or no non-empty files, return NONE.
    
    We can reduce these to a simpler set of rules:
    
    * We ignore all empty or missing schemas.
    * We ignore empty files unless they are the only file(s) or they have a schema different
than the next non-empty file.
    
    Other rules are possible. One is simply to ignore all 0-length files regardless of whether
the schema is the same or different than surrounding files. If all files return 0 records,
just return NONE. (These are the rules adopted, as of now, by the new scan operator; though
we will change them to match the rules here one we pin them down.)


> Schema change problems caused by empty batch
> --------------------------------------------
>
>                 Key: DRILL-5546
>                 URL: https://issues.apache.org/jira/browse/DRILL-5546
>             Project: Apache Drill
>          Issue Type: Bug
>            Reporter: Jinfeng Ni
>            Assignee: Jinfeng Ni
>
> There have been a few JIRAs opened related to schema change failure caused by empty batch.
This JIRA is opened as an umbrella for all those related JIRAS ( such as DRILL-4686, DRILL-4734,
DRILL4476, DRILL-4255, etc).
>  



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message