Mailing-List: contact issues-help@drill.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@drill.apache.org
Date: Mon, 21 Aug 2017 19:54:02 +0000 (UTC)
From: "ASF GitHub Bot (JIRA)" <jira@apache.org>
To: issues@drill.apache.org
Message-ID: <JIRA.13075263.1495831286000.82716.1503345242891@Atlassian.JIRA>
In-Reply-To: <JIRA.13075263.1495831286000@Atlassian.JIRA>
References: <JIRA.13075263.1495831286000@Atlassian.JIRA> <JIRA.13075263.1495831286605@jira-lw-us.apache.org>
Subject: [jira] [Commented] (DRILL-5546) Schema change problems caused by
 empty batch
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit
archived-at: Mon, 21 Aug 2017 19:54:14 -0000


    [ https://issues.apache.org/jira/browse/DRILL-5546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16135710#comment-16135710 ] 

ASF GitHub Bot commented on DRILL-5546:
---------------------------------------

Github user paul-rogers commented on a diff in the pull request:

    https://github.com/apache/drill/pull/906#discussion_r134305681
  
    --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/ScanBatch.java ---
    @@ -152,97 +157,75 @@ public void kill(boolean sendUpstream) {
         }
       }
     
    -  private void releaseAssets() {
    -    container.zeroVectors();
    -  }
    -
    -  private void clearFieldVectorMap() {
    -    for (final ValueVector v : mutator.fieldVectorMap().values()) {
    -      v.clear();
    -    }
    -  }
    -
       @Override
       public IterOutcome next() {
         if (done) {
           return IterOutcome.NONE;
         }
         oContext.getStats().startProcessing();
         try {
    -      try {
    -        injector.injectChecked(context.getExecutionControls(), "next-allocate", OutOfMemoryException.class);
    -
    -        currentReader.allocate(mutator.fieldVectorMap());
    -      } catch (OutOfMemoryException e) {
    -        clearFieldVectorMap();
    -        throw UserException.memoryError(e).build(logger);
    -      }
    -      while ((recordCount = currentReader.next()) == 0) {
    +      while (true) {
             try {
    -          if (!readers.hasNext()) {
    -            // We're on the last reader, and it has no (more) rows.
    -            currentReader.close();
    -            releaseAssets();
    -            done = true;  // have any future call to next() return NONE
    -
    -            if (mutator.isNewSchema()) {
    -              // This last reader has a new schema (e.g., we have a zero-row
    -              // file or other source).  (Note that some sources have a non-
    -              // null/non-trivial schema even when there are no rows.)
    +          injector.injectChecked(context.getExecutionControls(), "next-allocate", OutOfMemoryException.class);
    --- End diff --
    
    From the PR description:
    
    > 1. Skip RecordReader if it returns 0 row && present same schema. A new schema (by calling Mutator.isNewSchema() ) means either a new top level field is added, or a field in a nested field is added, or an existing field type is changed.
    
    The code, however, adds an additional condition: if implicit fields change. (But, as noted below, that should never occur in practice.)
    
    What happens on the first reader? There is no schema, so *any* schema is a new schema. Suppose the file is JSON and the schema is built on the fly. Does the code handle the case that we have no schema (first reader), and that reader adds no columns?
    
    Or, according to the logic that the downstream wants to know the schema, even if there are no records, do we send an empty schema (schema with no columns) downstream, because that is an accurate representation of an empty JSON file?
    
    What happens in the case of an empty JSON file following a non-empty file? In this case, do we consider the empty schema as a schema change relative to a non-empty change?
    
    In short, can we generalize this first rule a bit?
    
    > 2. Implicit columns are added and populated only when the input is not empty, i.e. the batch contains > 0 row or rowCount == 0 && new schema.
    
    How does this interact with a scan batch that has only one file, and that file is empty? Would we return the empty schema downstream? With the implicit columns?
    
    > 3. ScanBatch will return NONE directly (called as "fast NONE"), if all its RecordReaders haver empty input and thus are skipped, in stead of returing OK_NEW_SCHEMA first.
    
    
    This is just a bit ambiguous. If the reader is JSON, then an empty file has an empty schema (for reasons cited above.)
    
    But, if the input is CSV, then we *always* have a schema. If the file has column headers, then we know that the schema is, say, (a, b, c) because those are the headers. Or, if the file has no headers, the schema is always the `columns` array. So, should we send that schema downstream? If so, should it include the implicit columns?
    
    This, in fact, raises another issue (out of scope for this PR): if we return an empty batch with non-empty schema, we have no place to attach the implicit columns that will allow the user to figure out that, say, "foo.csv" is empty.
    
    On the other hand, if we say that an empty CSV file has no schema, then we can skip that file. The same might be true of JSON. What about Parquet? We'd have a schema even if there are no rows. Same with JDBC. Should we return this schema, even if the data is empty?
    
    Finally, do we need special handling for "null" files: a file that no longer exists on disk and so has a completely undefined schema? An undefined schema is different than an empty schema. Undefined = "we just don't know what the schema might be." Empty = "we know there is a schema and that schema is empty." A missing CSV file has an undefined schema. We could argue that a 0-length (or, a one-byte file with the only content being a newline) is a valid file, but it is a file with no columns, hence an empty schema.
    
    Or, should we have a rule that says something like:
    
    1. If a file is missing, or has an undefined schema, skip it.
    2. If a file has a schema (empty or not) and no rows, then:
      * If this is the first file, remember the schema and skip to the next.
      * If this file follows a non-empty file, trigger a schema change.
    4. If a file has a rows, then:
      * If the previous schema was empty, ignore that empty schema.
      * If the previous schema was non-empty, and different, return OK_NEW_SCHEMA
      * If the previous schema was the same, return OK.
    5. At EOF:
      * If the previous schema was empty, and the previous file had no rows, return that schema.
      * If the previous schema was empty, or no non-empty files, return NONE.
    
    We can reduce these to a simpler set of rules:
    
    * We ignore all empty or missing schemas.
    * We ignore empty files unless they are the only file(s) or they have a schema different than the next non-empty file.
    
    Other rules are possible. One is simply to ignore all 0-length files regardless of whether the schema is the same or different than surrounding files. If all files return 0 records, just return NONE. (These are the rules adopted, as of now, by the new scan operator; though we will change them to match the rules here one we pin them down.)


> Schema change problems caused by empty batch
> --------------------------------------------
>
>                 Key: DRILL-5546
>                 URL: https://issues.apache.org/jira/browse/DRILL-5546
>             Project: Apache Drill
>          Issue Type: Bug
>            Reporter: Jinfeng Ni
>            Assignee: Jinfeng Ni
>
> There have been a few JIRAs opened related to schema change failure caused by empty batch. This JIRA is opened as an umbrella for all those related JIRAS ( such as DRILL-4686, DRILL-4734, DRILL4476, DRILL-4255, etc).
>  


--
This message was sent by Atlassian JIRA
(v6.4.14#64029)