hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Navis (JIRA)" <j...@apache.org>
Subject [jira] [Created] (HIVE-3244) Add table property which constraints sorting/bucketing for data loading
Date Mon, 09 Jul 2012 05:32:35 GMT
Navis created HIVE-3244:
---------------------------

             Summary: Add table property which constraints sorting/bucketing for data loading
                 Key: HIVE-3244
                 URL: https://issues.apache.org/jira/browse/HIVE-3244
             Project: Hive
          Issue Type: Improvement
          Components: Query Processor
    Affects Versions: 0.10.0
         Environment: ubuntu 10.10
            Reporter: Navis
            Assignee: Navis
            Priority: Minor


This ticket is intended to implement "INSERT INTO" to bucketed table.

With hive.enforce.bucketing option, user can append data to bucketed table. But current implementation
depends on lexical order of file names for determining bucket number of file, which is not
always true.

So if file name is suffixed with bucket number when inserting(moving), it can be acquired
rightly when it is needed, such as in BucketMapJoinOptimizer.

With simple prototype codes, which will be attached after writing this, the test query
{noformat}
create table bucket_test (key int, value string) clustered by (key) sorted by (key) into 4
buckets TBLPROPERTIES
('FORCEDBUCKETING'='TRUE', 'FORCEDSORTING'='TRUE');

set hive.optimize.bucketmapjoin = true;

insert into table bucket_test select key, value from src1;
explain extended select /*+MAPJOIN(b)*/ * from bucket_test a join bucket_test b on a.key=b.key;

insert into table bucket_test select key, value from src1;
explain extended select /*+MAPJOIN(b)*/ * from bucket_test a join bucket_test b on a.key=b.key;
{noformat}

resulted as below
{noformat}
1. first plan
 b {000000_0_[0]=[000000_0_[0]], 000001_0_[1]=[000001_0_[1]], 000002_0_[2]=[000002_0_[2]],
000003_0_[3]=[000003_0_[3]]}

2. second plan
 b {000000_0_[0]=[000000_0_[0], 000000_0_copy_1_[0]], 000000_0_copy_1_[0]=[000000_0_[0], 000000_0_copy_1_[0]],
000001_0_[1]=[000001_0_[1], 000001_0_copy_1_[1]], 000001_0_copy_1_[1]=[000001_0_[1], 000001_0_copy_1_[1]],
000002_0_[2]=[000002_0_[2], 000002_0_copy_1_[2]], 000002_0_copy_1_[2]=[000002_0_[2], 000002_0_copy_1_[2]],
000003_0_[3]=[000003_0_[3], 000003_0_copy_1_[3]], 000003_0_copy_1_[3]=[000003_0_[3], 000003_0_copy_1_[3]]}
{noformat}

Currently, I've prevented direct loading via 'LOAD DATA' for forced bucket table. But with
proper name validation, that could be allowed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message