drill-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Vitalii Diravka (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (DRILL-5826) UnorderedReceiverBatch fails to detect a schema change within a map
Date Fri, 29 Sep 2017 20:48:00 GMT

    [ https://issues.apache.org/jira/browse/DRILL-5826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16186403#comment-16186403

Vitalii Diravka commented on DRILL-5826:

I have the same observations.

Can we skip this first empty batch like others empty bathes  ["skip over empty batches"|https://github.com/apache/drill/blob/3e8b01d5b0d3013e3811913f0fd6028b22c1ac3f/exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/unorderedreceiver/UnorderedReceiverBatch.java#L161]?
Looks like the following change can resolve the issue: 
// skip over empty batches. we do this since these are basically control messages.
        while (batch != null && batch.getHeader().getDef().getRecordCount() == 0)
          batch = getNextBatch();

> UnorderedReceiverBatch fails to detect a schema change within a map
> -------------------------------------------------------------------
>                 Key: DRILL-5826
>                 URL: https://issues.apache.org/jira/browse/DRILL-5826
>             Project: Apache Drill
>          Issue Type: Bug
>    Affects Versions: 1.11.0
>            Reporter: Paul Rogers
>            Assignee: Paul Rogers
> Run the following HBase query using:
> {code}
> select * from `hbase`.browser_action2 a
> {code}
> Table is defined as:
> {code}
> > create 'browser_action2', 'v', {SPLITS => ['0','1','2','3','4','5','6','7','8','9']}
> ...
> > scan 'browser_action2'
> ROW                                   COLUMN+CELL                                   
>  1                                    column=v:e0, timestamp=1506560555979, value=abc1
>  2                                    column=v:e0, timestamp=1506560564807, value=abc2
> {code}
> Step through the {{UnorderedReceiverBatch}} with a parallelization of 1. Observe the
following (behavior is random):
> * The first batch has schema (row_key, v) where v is an empty map (corresponding to a
column family), but no data (zero rows.)
> * Because the first batch has columns, it is sent downstream with {{OK_NEW_SCHEMA}}.
> * The second batch has schema (row_key, v{e0}), where v is a map with column e0 (corresponding
to a column family with one column) and one row.
> * The code loads the batch, asking the batch itself if it has a new schema.
> * The batch does not have a new schema so returns false.
> * The {{UnorderedReceiverBatch}} returns {OK}, indicating to the downstream operator
that the second batch has the same schema as the first (which, in this case, turns out to
not be true.)
> Code in question:
> {code}
>       final boolean schemaChanged = batchLoader.load(rbd, batch.getBody());
> {code}
> In point of fact, each sender has no visibility to the schema of other senders, and the
order of receiving batches is undefined. Therefore, an input batch has no way of knowing if
it has the same schema as the previous output batch.
> The obvious, correct, logic is to compare the incoming batch schema with the current
receiver schema, and send {{OK}} or {{OK_NEW_SCHEMA}} based on the result of that comparison.

This message was sent by Atlassian JIRA

View raw message