drill-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jinfeng Ni (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (DRILL-5480) Empty batch returning from HBase may cause SchemChangeException or incorrect query result
Date Tue, 05 Sep 2017 22:09:00 GMT

    [ https://issues.apache.org/jira/browse/DRILL-5480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16154452#comment-16154452
] 

Jinfeng Ni commented on DRILL-5480:
-----------------------------------

Run the above two queries on the patch of DRILL-5546, commit id: fde0a1df1734e0742b49aabdd28b02202ee2b044

{code}
select * from hbase.customer c, cp.`tpch/orders.parquet` o where cast(c.orders.id as bigint)
= o.o_orderkey and c.orders.id <= '200';
+--------------+----------------+-------------+------------+----------------+---------------+--------------+------------------+------------------+-----------------+-----------------------------------------------+
|   row_key    |     orders     | o_orderkey  | o_custkey  | o_orderstatus  | o_totalprice
 | o_orderdate  | o_orderpriority  |     o_clerk      | o_shippriority  |                
  o_comment                   |
+--------------+----------------+-------------+------------+----------------+---------------+--------------+------------------+------------------+-----------------+-----------------------------------------------+
| [B@3d5045ad  | {"id":"MTAw"}  | 100         | 1471       | O              | 198978.27  
  | 1998-02-28   | 4-NOT SPECIFIED  | Clerk#000000577  | 0               | heodolites detect
slyly alongside of the ent  |
+--------------+----------------+-------------+------------+----------------+---------------+--------------+------------------+------------------+-----------------+-----------------------------------------------+
1 row selected (7.488 seconds)
0: jdbc:drill:drillbit=10.10.10.244> select * from hbase.customer2 c, cp.`tpch/orders.parquet`
o where cast(c.orders.id as bigint) = o.o_orderkey and c.orders.id <= '500';
+--------------+----------------+-------------+------------+----------------+---------------+--------------+------------------+------------------+-----------------+-----------------------------------------------+
|   row_key    |     orders     | o_orderkey  | o_custkey  | o_orderstatus  | o_totalprice
 | o_orderdate  | o_orderpriority  |     o_clerk      | o_shippriority  |                
  o_comment                   |
+--------------+----------------+-------------+------------+----------------+---------------+--------------+------------------+------------------+-----------------+-----------------------------------------------+
| [B@30b078b0  | {"id":"MTAw"}  | 100         | 1471       | O              | 198978.27  
  | 1998-02-28   | 4-NOT SPECIFIED  | Clerk#000000577  | 0               | heodolites detect
slyly alongside of the ent  |
+--------------+----------------+-------------+------------+----------------+---------------+--------------+------------------+------------------+-----------------+-----------------------------------------------+
{code}

However, if there is hard schema change as shown in the table `'myhbase`, since different
hbase region contains different set of column family / column, in addition to having empty
region. If Drill need to access two regions with hard schema change, then we will still end
up with schema change failure.



> Empty batch returning from HBase may cause SchemChangeException or incorrect query result
> -----------------------------------------------------------------------------------------
>
>                 Key: DRILL-5480
>                 URL: https://issues.apache.org/jira/browse/DRILL-5480
>             Project: Apache Drill
>          Issue Type: Bug
>            Reporter: Jinfeng Ni
>            Assignee: Jinfeng Ni
>
> The following repo was provided by [~haozhu].
> 1. Create a Hbase table with 4 regions
> {code}
> create 'myhbase', 'cf1','cf2', {SPLITS => ['a', 'b', 'c']}
> put 'myhbase','a','cf1:col1','somedata'
> put 'myhbase','b','cf1:col2','somedata'
> put 'myhbase','c','cf2:col1','somedata'
> {code}
> One region has cf1.col1.  One region has column family 'cf1', but does not have 'col1'
under 'cf1'. One region has only column family 'cf2'. And last region is complete empty.
> 2. Prepare a csv file.
> {code}
> select * from dfs.tmp.`joinhbase.csv`;
> +-------------------+
> |      columns      |
> +-------------------+
> | ["1","somedata"]  |
> | ["2","somedata"]  |
> | ["3","somedata"]  |
> {code}
> Now run the following query on drill 1.11.0-SNAPSHOT:
> {code}
> select cast(H.row_key as varchar(10)) as keyCol, CONVERT_FROM(H.cf1.col1, 'UTF8') as
col1
> from 
> hbase.myhbase H JOIN dfs.tmp.`joinhbase.csv` C
> ON CONVERT_FROM(H.cf1.col1, 'UTF8')= C.columns[1]
> ;
> {code}
> The correct query result show be:
> {code}
> +---------+-----------+
> | keyCol  |   col1    |
> +---------+-----------+
> | a       | somedata  |
> | a       | somedata  |
> | a       | somedata  |
> +---------+-----------+
> {code}
> Turn off broadcast join, then we will see SchemaChangeException, or incorrect result
randomly. By 'randomly', it means in the same session, the same query would hit SchemaChangeException
in one run, while gets incorrect result in a second run. 
> {code}
> alter session set `planner.enable_broadcast_join`=false;
> {code}
> {code}
> select cast(H.row_key as varchar(10)) as keyCol, CONVERT_FROM(H.cf1.col1, 'UTF8') as
col1
> . . . . . . . . . . . . . . . . . .> from
> . . . . . . . . . . . . . . . . . .> hbase.myhbase H JOIN dfs.tmp.`joinhbase.csv`
C
> . . . . . . . . . . . . . . . . . .> ON CONVERT_FROM(H.cf1.col1, 'UTF8')= C.columns[1]
> . . . . . . . . . . . . . . . . . .> ;
> Error: SYSTEM ERROR: SchemaChangeException: Hash join does not support schema changes
> {code}
> {code}
> +---------+-------+
> | keyCol  | col1  |
> +---------+-------+
> +---------+-------+
> No rows selected (0.302 seconds)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message