hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Prasanth J (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HIVE-6872) Explore options of optimizing FileSinkOperator-->getDynOutPaths()
Date Tue, 15 Apr 2014 17:43:26 GMT

    [ https://issues.apache.org/jira/browse/HIVE-6872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13969797#comment-13969797
] 

Prasanth J commented on HIVE-6872:
----------------------------------

[~rajesh.balamohan] Can you please post the patch in Review Board? Here is the link https://reviews.apache.org/r/new/

> Explore options of optimizing FileSinkOperator-->getDynOutPaths()
> -----------------------------------------------------------------
>
>                 Key: HIVE-6872
>                 URL: https://issues.apache.org/jira/browse/HIVE-6872
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Rajesh Balamohan
>            Assignee: Rajesh Balamohan
>            Priority: Critical
>         Attachments: HIVE-6782-v3.patch, HIVE-6782-v4.patch
>
>
> 1. Download hive-testbench from https://github.com/cartershanklin/hive-testbench
> 2. Generate data using "./tpcds-setup.sh 10 /user/hive/external partitioned" 
> 3. Most of the data population for tables with "partition + bucket + sorted data" will
run a lot slower even with scale factor of 10 on 20 node cluster.
> Bottleneck seems to be in FileSinkOperator-->getDynOutPaths() where it tries to close
FSPath writers.  Every call takes almost 150-200 ms. 
> set hive.enforce.bucketing=true;
> set hive.exec.dynamic.partition.mode=nonstrict;
> set hive.exec.max.dynamic.partitions.pernode=4096;
> With the above setting, one of the data loading (for web_sales table) took almost 4096
* 150 = 600 seconds just in closing the writers sequentially.  
> Purpose of this jira is to figure out options of optimizing FileSinkOperator-->getDynOutPaths()
 code path.  This will be beneficial especially in ETL type of workloads.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message