hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Namit Jain (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HIVE-3244) Add table property which constraints sorting/bucketing for data loading
Date Wed, 18 Jul 2012 10:39:35 GMT

    [ https://issues.apache.org/jira/browse/HIVE-3244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13416986#comment-13416986
] 

Namit Jain commented on HIVE-3244:
----------------------------------

Let us use strict mode for that.
Adding more and more properties may be more confusing.
                
> Add table property which constraints sorting/bucketing for data loading
> -----------------------------------------------------------------------
>
>                 Key: HIVE-3244
>                 URL: https://issues.apache.org/jira/browse/HIVE-3244
>             Project: Hive
>          Issue Type: Improvement
>          Components: Query Processor
>    Affects Versions: 0.10.0
>         Environment: ubuntu 10.10
>            Reporter: Navis
>            Assignee: Navis
>            Priority: Minor
>
> This ticket is intended to implement "INSERT INTO" to bucketed table.
> With hive.enforce.bucketing option, user can append data to bucketed table. But current
implementation depends on lexical order of file names for determining bucket number of file,
which is not always true.
> So if file name is suffixed with bucket number when inserting(moving), it can be acquired
rightly when it is needed, such as in BucketMapJoinOptimizer.
> With simple prototype codes, which will be attached after writing this, the test query
> {noformat}
> create table bucket_test (key int, value string) clustered by (key) sorted by (key) into
4 buckets TBLPROPERTIES
> ('FORCEDBUCKETING'='TRUE', 'FORCEDSORTING'='TRUE');
> set hive.optimize.bucketmapjoin = true;
> insert into table bucket_test select key, value from src1;
> explain extended select /*+MAPJOIN(b)*/ * from bucket_test a join bucket_test b on a.key=b.key;
> insert into table bucket_test select key, value from src1;
> explain extended select /*+MAPJOIN(b)*/ * from bucket_test a join bucket_test b on a.key=b.key;
> {noformat}
> resulted as below
> {noformat}
> 1. first plan
>  b {000000_0_[0]=[000000_0_[0]], 000001_0_[1]=[000001_0_[1]], 000002_0_[2]=[000002_0_[2]],
000003_0_[3]=[000003_0_[3]]}
> 2. second plan
>  b {000000_0_[0]=[000000_0_[0], 000000_0_copy_1_[0]], 000000_0_copy_1_[0]=[000000_0_[0],
000000_0_copy_1_[0]], 000001_0_[1]=[000001_0_[1], 000001_0_copy_1_[1]], 000001_0_copy_1_[1]=[000001_0_[1],
000001_0_copy_1_[1]], 000002_0_[2]=[000002_0_[2], 000002_0_copy_1_[2]], 000002_0_copy_1_[2]=[000002_0_[2],
000002_0_copy_1_[2]], 000003_0_[3]=[000003_0_[3], 000003_0_copy_1_[3]], 000003_0_copy_1_[3]=[000003_0_[3],
000003_0_copy_1_[3]]}
> {noformat}
> Currently, I've prevented direct loading via 'LOAD DATA' for forced bucket table. But
with proper name validation, that could be allowed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message