drill-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (DRILL-6223) Drill fails on Schema changes
Date Fri, 30 Mar 2018 03:01:00 GMT

    [ https://issues.apache.org/jira/browse/DRILL-6223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16420103#comment-16420103

ASF GitHub Bot commented on DRILL-6223:

Github user paul-rogers commented on the issue:

    Sorry to say, I still disagree with this statement: "This pull request adds logic to detect
and eliminate dangling columns".
    There was a prior discussion that `SELECT *` means "return all columns", not "return only
those columns which happen to be in common." We discussed how removing columns midway through
a DAG can produce inconsistent results.
    But, let's take this particular case: the Project operator.
    What should happen (to be consistent with other parts of Drill), is that the operator
correctly fills in values for the "dangling" F3 so the the output is (F1, F2, F3).
    Note that this becomes VERY ambiguous. Suppose Projection sees the following from Files
A(F1, F2, F3) and B(F1, F2)
    * Batch A.1 (with F1, F2, F3)
    * Batch B.1 (with F1, F2)
    Clearly, the project can remember that F3 was previously seen and fill in the missing
column. (This is exactly what the new projection logic in the new scan framework does, by
the way.) This works, however, only if F3 is nullable. If not... what (non-null) value can
we fill in for F3?
    Had we know that F3 would turn up dangling, we could have converted F3 in the first batch
to become nullable, but Drill can't predict the future.
    Let's consider the proposal: we drop dangling columns. But, since the dangling column
(F3) appeared in the first batch, we didn't know it is dangling. Only when we see the second
batch (B.1) do we realize that F3 was dangling and we should have removed it. Again, this
algorithm requires effective time travel.
    Now, suppose that the order is reversed:
    * Batch B.1 (with F1, F2)
    * Batch A.1 (with F1, F2, F3)
    Here, we can identify F3 as dangling and could remove it, so the proposal is sound.
    On the other hand, the "fill in F3" trick does not work here because Project sends B.1
downstream. Later, it notices that A.1 adds a column. Project can't somehow retroactively
add the missing column; all it can do is trigger a schema change downstream. Again, Drill
can't predict the future to know that it has to fill in F3 in the first B.1 batch.
    We've not yet discussed the case in which F2, which exists in both files, has distinct
types (INT in A, say and VARCHAR in B). The dangling column trick won't work. The same logic
as above applies to the type mismatch.
    Perhaps we use either the "remove" or "fill in" depending on whether the column appears
in the first batch. So, for the following:
    * Batch A.1 (with F1, F2, F3)
    * Batch B.1 (with F1, F2)
    The result would be (F1, F2, F3)
    But if the input was:
    * Batch B.1 (with F1, F2)
    * Batch A.1 (with F1, F2, F3)
    The result wold be (F1, F2)
    Since the user has no control over the order that files are read, the result would be
random: half the time the user gets one schema, the other half the other. It is unlikely that
the user will perceive that as a feature.
    The general conclusion is that there is no way that Project can "smooth" the schema in
the general case: it would have to predict the future to do so.
    Now, let's think about other operators, Sort, say. Suppose we do `SELECT * FROM foo ORDER
BY x`. In this case, there is no project. The Sort operator will see batches with differing
schemas, but must sort/merge them together. The schemas must match. The Sort tries to do this
by using the union type (actually works, there is a unit test for it somewhere) if columns
have conflicting types. I suspect it does not work if a column is missing from some batches.
(Would need to test.)
    And, of course, the situation is worse if the dangling column is the sort key!
    Overall, while it is very appealing to think that dangling columns are a "bug" for which
there is a fix, the reality is not so simple. This is an inherent ambiguity in the Drill model
for which there is no fix that both works and is consistent with SQL semantics.
    What would work? A schema! Suppose we are told that the schema is (F1: INT, F2: VARCHAR,
F3: nullable DOUBLE). Now, when Project (or even the scanner) notices that F3 is missing,
it knows to add in the required column of the correct type and, voila! no schema change.
    Suppose that B defines F2 as INT. We know we want a VARCHAR, and so can do an implicit
conversion. Again, voila! no schema change.
    In summary, I completely agree that the scenario described is a problem. But, I don't
believe that removing columns is the fix; instead the only valid fix is to allow the user
to provide a schema.

> Drill fails on Schema changes 
> ------------------------------
>                 Key: DRILL-6223
>                 URL: https://issues.apache.org/jira/browse/DRILL-6223
>             Project: Apache Drill
>          Issue Type: Improvement
>          Components: Execution - Relational Operators
>    Affects Versions: 1.10.0, 1.12.0
>            Reporter: salim achouche
>            Assignee: salim achouche
>            Priority: Major
>             Fix For: 1.14.0
> Drill Query Failing when selecting all columns from a Complex Nested Data File (Parquet)
Set). There are differences in Schema among the files:
>  * The Parquet files exhibit differences both at the first level and within nested data
>  * A select * will not cause an exception but using a limit clause will
>  * Note also this issue seems to happen only when multiple Drillbit minor fragments are
involved (concurrency higher than one)

This message was sent by Atlassian JIRA

View raw message