hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Yongzhi Chen (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HIVE-10866) Give a warning when client try to insert into bucketed table
Date Fri, 12 Jun 2015 17:43:01 GMT

     [ https://issues.apache.org/jira/browse/HIVE-10866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Yongzhi Chen updated HIVE-10866:
--------------------------------
    Description: 
Currently, hive does not support appends(insert into) bucketed table, see open jira HIVE-3608.
When insert into such table, the data will be "corrupted" and not fit for sort merge bucket
mapjoin. 
We need find a way to prevent client from inserting into such table. Or at least give a warning.
Reproduce:
{noformat}
CREATE TABLE IF NOT EXISTS buckettestoutput1( 
data string 
)CLUSTERED BY(data) 
INTO 2 BUCKETS 
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
CREATE TABLE IF NOT EXISTS buckettestoutput2( 
data string 
)CLUSTERED BY(data) 
INTO 2 BUCKETS 
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';

set hive.enforce.bucketing = true; 
set hive.enforce.sorting=true;
insert into table buckettestoutput1 select code from sample_07 where total_emp < 134354250
limit 10;
After this first insert, I did:
set hive.auto.convert.sortmerge.join=true; 
set hive.optimize.bucketmapjoin = true; 
set hive.optimize.bucketmapjoin.sortedmerge = true; 
set hive.auto.convert.sortmerge.join.noconditionaltask=true;

0: jdbc:hive2://localhost:10000> select * from buckettestoutput1 a join buckettestoutput2
b on (a.data=b.data);
+-------+-------+
| data  | data  |
+-------+-------+
+-------+-------+
So select works fine. 
Second insert:
0: jdbc:hive2://localhost:10000> insert into table buckettestoutput1 select code from sample_07
where total_emp >= 134354250 limit 10;
No rows affected (61.235 seconds)
Then select:
0: jdbc:hive2://localhost:10000> select * from buckettestoutput1 a join buckettestoutput2
b on (a.data=b.data);
Error: Error while compiling statement: FAILED: SemanticException [Error 10141]: Bucketed
table metadata is not correct. Fix the metadata or don't use bucketed mapjoin, by setting
hive.enforce.bucketmapjoin to false. The number of buckets for table buckettestoutput1 is
2, whereas the number of files is 4 (state=42000,code=10141)
0: jdbc:hive2://localhost:10000>
{noformat}
Insert into empty table or partition will be fine, but insert into the non-empty one (after
second insert in the reproduce), the SMB mapjoin will throw an error. We should not let second
insert succeed when user explicitly want to the table SMB available. 

  was:
Currently, hive does not support appends(insert into) bucketed table, see open jira HIVE-3608.
When insert into such table, the data will be "corrupted" and not fit for sort merge bucket
mapjoin. 
We need find a way to prevent client from inserting into such table. Or at least give a warning.
Reproduce:
{noformat}
CREATE TABLE IF NOT EXISTS buckettestoutput1( 
data string 
)CLUSTERED BY(data) 
INTO 2 BUCKETS 
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
CREATE TABLE IF NOT EXISTS buckettestoutput2( 
data string 
)CLUSTERED BY(data) 
INTO 2 BUCKETS 
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';

set hive.enforce.bucketing = true; 
set hive.enforce.sorting=true;
insert into table buckettestoutput1 select code from sample_07 where total_emp < 134354250
limit 10;
After this first insert, I did:
set hive.auto.convert.sortmerge.join=true; 
set hive.optimize.bucketmapjoin = true; 
set hive.optimize.bucketmapjoin.sortedmerge = true; 
set hive.auto.convert.sortmerge.join.noconditionaltask=true;

0: jdbc:hive2://localhost:10000> select * from buckettestoutput1 a join buckettestoutput2
b on (a.data=b.data);
+-------+-------+
| data  | data  |
+-------+-------+
+-------+-------+
So select works fine. 
Second insert:
0: jdbc:hive2://localhost:10000> insert into table buckettestoutput1 select code from sample_07
where total_emp >= 134354250 limit 10;
No rows affected (61.235 seconds)
Then select:
0: jdbc:hive2://localhost:10000> select * from buckettestoutput1 a join buckettestoutput2
b on (a.data=b.data);
Error: Error while compiling statement: FAILED: SemanticException [Error 10141]: Bucketed
table metadata is not correct. Fix the metadata or don't use bucketed mapjoin, by setting
hive.enforce.bucketmapjoin to false. The number of buckets for table buckettestoutput1 is
2, whereas the number of files is 4 (state=42000,code=10141)
0: jdbc:hive2://localhost:10000>
{noformat}
Insert into empty table or partition will be fine, but insert into the non-empty one (after
second insert in the reproduce), the SMB mapjoin will throw an error. We should not let second
insert succeed. 


> Give a warning when client try to insert into bucketed table
> ------------------------------------------------------------
>
>                 Key: HIVE-10866
>                 URL: https://issues.apache.org/jira/browse/HIVE-10866
>             Project: Hive
>          Issue Type: Improvement
>    Affects Versions: 1.2.0, 1.3.0
>            Reporter: Yongzhi Chen
>            Assignee: Yongzhi Chen
>         Attachments: HIVE-10866.1.patch, HIVE-10866.2.patch, HIVE-10866.3.patch, HIVE-10866.4.patch
>
>
> Currently, hive does not support appends(insert into) bucketed table, see open jira HIVE-3608.
When insert into such table, the data will be "corrupted" and not fit for sort merge bucket
mapjoin. 
> We need find a way to prevent client from inserting into such table. Or at least give
a warning.
> Reproduce:
> {noformat}
> CREATE TABLE IF NOT EXISTS buckettestoutput1( 
> data string 
> )CLUSTERED BY(data) 
> INTO 2 BUCKETS 
> ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
> CREATE TABLE IF NOT EXISTS buckettestoutput2( 
> data string 
> )CLUSTERED BY(data) 
> INTO 2 BUCKETS 
> ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
> set hive.enforce.bucketing = true; 
> set hive.enforce.sorting=true;
> insert into table buckettestoutput1 select code from sample_07 where total_emp < 134354250
limit 10;
> After this first insert, I did:
> set hive.auto.convert.sortmerge.join=true; 
> set hive.optimize.bucketmapjoin = true; 
> set hive.optimize.bucketmapjoin.sortedmerge = true; 
> set hive.auto.convert.sortmerge.join.noconditionaltask=true;
> 0: jdbc:hive2://localhost:10000> select * from buckettestoutput1 a join buckettestoutput2
b on (a.data=b.data);
> +-------+-------+
> | data  | data  |
> +-------+-------+
> +-------+-------+
> So select works fine. 
> Second insert:
> 0: jdbc:hive2://localhost:10000> insert into table buckettestoutput1 select code from
sample_07 where total_emp >= 134354250 limit 10;
> No rows affected (61.235 seconds)
> Then select:
> 0: jdbc:hive2://localhost:10000> select * from buckettestoutput1 a join buckettestoutput2
b on (a.data=b.data);
> Error: Error while compiling statement: FAILED: SemanticException [Error 10141]: Bucketed
table metadata is not correct. Fix the metadata or don't use bucketed mapjoin, by setting
hive.enforce.bucketmapjoin to false. The number of buckets for table buckettestoutput1 is
2, whereas the number of files is 4 (state=42000,code=10141)
> 0: jdbc:hive2://localhost:10000>
> {noformat}
> Insert into empty table or partition will be fine, but insert into the non-empty one
(after second insert in the reproduce), the SMB mapjoin will throw an error. We should not
let second insert succeed when user explicitly want to the table SMB available. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message