hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Rajesh Balamohan (JIRA)" <>
Subject [jira] [Commented] (HIVE-14128) Parallelize jobClose phases
Date Mon, 01 Aug 2016 11:42:20 GMT


Rajesh Balamohan commented on HIVE-14128:

[~ashutoshc] - In non-partitioned case, there can be multiple part files within the temp directory.
When this is moved in HDFS, it would be simpler. But in some file systems like S3, it would
turn out to be expensive still.  E.g lineitem is a non-partitioned dataset in TPC-H.  Simple
insert overwrite would have the following move at the end of the job.  Please note that this
internally has 300+ part files. So it rename would turn out to be expensive here.

2016-08-01T04:40:00,154  INFO [JobClose-Thread-0] exec.FileSinkOperator: Moving tmp dir: s3a://bucket/lineitem/.hive-staging_hive_2016-08-01_04-31-26_432_5317262787271448273-1/_tmp.-ext-10000
to: s3a://bucket/lineitem/.hive-staging_hive_2016-08-01_04-31-26_432_5317262787271448273-1/-ext-10000

Should we consider a file by file move in such cases?

> Parallelize jobClose phases
> ---------------------------
>                 Key: HIVE-14128
>                 URL:
>             Project: Hive
>          Issue Type: Improvement
>          Components: Query Processor
>    Affects Versions: 1.2.0, 2.0.0, 2.1.0
>            Reporter: Ashutosh Chauhan
>            Assignee: Ashutosh Chauhan
>         Attachments: HIVE-14128.1.patch, HIVE-14128.patch

This message was sent by Atlassian JIRA

View raw message