drill-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (DRILL-5546) Schema change problems caused by empty batch
Date Fri, 11 Aug 2017 21:24:01 GMT

    [ https://issues.apache.org/jira/browse/DRILL-5546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16124096#comment-16124096
] 

ASF GitHub Bot commented on DRILL-5546:
---------------------------------------

GitHub user jinfengni opened a pull request:

    https://github.com/apache/drill/pull/906

    DRILL-5546: Handle schema change exception failure caused by empty in…

    …put or empty batche.
    
    1. Modify ScanBatch's logic when it iterates list of RecordReader.
       1) Skip RecordReader if it returns 0 row && present same schema. A new schema
(by calling Mutator.isNewSchema() ) means either a new top level field is added, or a field
in a nested field is added, or an existing field type is changed.
       2) Implicit columns are added and populated only when the input is not empty, i.e.
the batch contains > 0 row or rowCount == 0 && new schema.
       3) ScanBatch will return NONE directly (called as "fast NONE"), if all its RecordReaders
haver empty input and thus are skipped, in stead of returing OK_NEW_SCHEMA first.
    
    2. Modify IteratorValidatorBatchIterator to allow
       1) fast NONE ( before seeing a OK_NEW_SCHEMA)
       2) batch with empty list of columns.
    
    2. Modify JsonRecordReader when it get 0 row. Do not insert a nullable-int column for
0 row input. Together with ScanBatch, Drill will skip empty json files.
    
    3. Modify binary operators such as join, union to handle fast none for either one side
or both sides. Abstract the logic in AbstractBinaryRecordBatch, except for MergeJoin as its
implementation is quite different from others.
    
    4. Fix and refactor union all operator.
      1) Correct union operator hanndling 0 input rows. Previously, it will ignore inputs
with 0 row and put nullable-int into output schema, which causes various of schema change
issue in down-stream operator. The new behavior is to take schema with 0 into account
      in determining the output schema, in the same way with > 0 input rows. By doing that,
we ensure Union operator will not behave like a schema-lossy operator.
      2) Add a UnionInputIterator to simplify the logic to iterate the left/right inputs,
removing significant chunk of duplicate codes in previous implementation.
      The new union all operator reduces the code size into half, comparing the old one.
    
    5. Introduce UntypedNullVector to handle convertFromJson() function, when the input batch
contains 0 row.
      Problem: The function convertFromJSon() is different from other regular functions in
that it only knows the output schema after evaluation is performed. When input has 0 row,
Drill essentially does not have
      a way to know the output type, and previously will assume Map type. That works under
the assumption other operators like Union would ignore batch with 0 row, which is no longer
      the case in the current implementation.
      Solution: Use MinorType.NULL at the output type for convertFromJSON() when input contains
0 row. The new UntypedNullVector is used to represent a column with MinorType.NULL.
    
    6. HBaseGroupScan convert star column into list of row_key and column family. HBaseRecordReader
should reject column star since it expectes star has been converted somewhere else.
      In HBase a column family always has map type, and a non-rowkey column always has nullable
varbinary type, this ensures that HBaseRecordReader across different HBase regions will have
the same top level schema, even if the region is
      empty or prune all the rows due to filter pushdown optimization. In other words, we
will not see different top level schema from different HBaseRecordReader for the same table.
      However, such change will not be able to handle hard schema change : c1 exists in cf1
in one region, but not in another region. Further work is required to handle hard schema change.
    
    7. Modify scan cost estimation when the query involves * column. This is to remove the
planning randomness since previously two different operators could have same cost.
    
    8. Add a new flag 'outputProj' to Project operator, to indicate if Project is for the
query's final output. Such Project is added by TopProjectVisitor, to handle fast NONE when
all the inputs to the query are empty
    and are skipped.
      1) column star is replaced with empty list
      2) regular column reference is replaced with nullable-int column
      3) An expression will go through ExpressionTreeMaterializer, and use the type of materialized
expression as the output type
      4) Return an OK_NEW_SCHEMA with the schema using the above logic, then return a NONE
to down-stream operator.
    
    9. Add unit test to test operators handling empty input.
    
    10. Add unit test to test query when inputs are all empty.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/jinfengni/incubator-drill DRILL-5546

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/drill/pull/906.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #906
    
----
commit b0110140f8375809af3deddf9881e64dc1242886
Author: Jinfeng Ni <jni@apache.org>
Date:   2017-05-17T23:08:00Z

    DRILL-5546: Handle schema change exception failure caused by empty input or empty batche.
    
    1. Modify ScanBatch's logic when it iterates list of RecordReader.
       1) Skip RecordReader if it returns 0 row && present same schema. A new schema
(by calling Mutator.isNewSchema() ) means either a new top level field is added, or a field
in a nested field is added, or an existing field type is changed.
       2) Implicit columns are added and populated only when the input is not empty, i.e.
the batch contains > 0 row or rowCount == 0 && new schema.
       3) ScanBatch will return NONE directly (called as "fast NONE"), if all its RecordReaders
haver empty input and thus are skipped, in stead of returing OK_NEW_SCHEMA first.
    
    2. Modify IteratorValidatorBatchIterator to allow
       1) fast NONE ( before seeing a OK_NEW_SCHEMA)
       2) batch with empty list of columns.
    
    2. Modify JsonRecordReader when it get 0 row. Do not insert a nullable-int column for
0 row input. Together with ScanBatch, Drill will skip empty json files.
    
    3. Modify binary operators such as join, union to handle fast none for either one side
or both sides. Abstract the logic in AbstractBinaryRecordBatch, except for MergeJoin as its
implementation is quite different from others.
    
    4. Fix and refactor union all operator.
      1) Correct union operator hanndling 0 input rows. Previously, it will ignore inputs
with 0 row and put nullable-int into output schema, which causes various of schema change
issue in down-stream operator. The new behavior is to take schema with 0 into account
      in determining the output schema, in the same way with > 0 input rows. By doing that,
we ensure Union operator will not behave like a schema-lossy operator.
      2) Add a UnionInputIterator to simplify the logic to iterate the left/right inputs,
removing significant chunk of duplicate codes in previous implementation.
      The new union all operator reduces the code size into half, comparing the old one.
    
    5. Introduce UntypedNullVector to handle convertFromJson() function, when the input batch
contains 0 row.
      Problem: The function convertFromJSon() is different from other regular functions in
that it only knows the output schema after evaluation is performed. When input has 0 row,
Drill essentially does not have
      a way to know the output type, and previously will assume Map type. That works under
the assumption other operators like Union would ignore batch with 0 row, which is no longer
      the case in the current implementation.
      Solution: Use MinorType.NULL at the output type for convertFromJSON() when input contains
0 row. The new UntypedNullVector is used to represent a column with MinorType.NULL.
    
    6. HBaseGroupScan convert star column into list of row_key and column family. HBaseRecordReader
should reject column star since it expectes star has been converted somewhere else.
      In HBase a column family always has map type, and a non-rowkey column always has nullable
varbinary type, this ensures that HBaseRecordReader across different HBase regions will have
the same top level schema, even if the region is
      empty or prune all the rows due to filter pushdown optimization. In other words, we
will not see different top level schema from different HBaseRecordReader for the same table.
      However, such change will not be able to handle hard schema change : c1 exists in cf1
in one region, but not in another region. Further work is required to handle hard schema change.
    
    7. Modify scan cost estimation when the query involves * column. This is to remove the
planning randomness since previously two different operators could have same cost.
    
    8. Add a new flag 'outputProj' to Project operator, to indicate if Project is for the
query's final output. Such Project is added by TopProjectVisitor, to handle fast NONE when
all the inputs to the query are empty
    and are skipped.
      1) column star is replaced with empty list
      2) regular column reference is replaced with nullable-int column
      3) An expression will go through ExpressionTreeMaterializer, and use the type of materialized
expression as the output type
      4) Return an OK_NEW_SCHEMA with the schema using the above logic, then return a NONE
to down-stream operator.
    
    9. Add unit test to test operators handling empty input.
    
    10. Add unit test to test query when inputs are all empty.

----


> Schema change problems caused by empty batch
> --------------------------------------------
>
>                 Key: DRILL-5546
>                 URL: https://issues.apache.org/jira/browse/DRILL-5546
>             Project: Apache Drill
>          Issue Type: Bug
>            Reporter: Jinfeng Ni
>            Assignee: Jinfeng Ni
>
> There have been a few JIRAs opened related to schema change failure caused by empty batch.
This JIRA is opened as an umbrella for all those related JIRAS ( such as DRILL-4686, DRILL-4734,
DRILL4476, DRILL-4255, etc).
>  



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message