hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "mahesh kumar behera (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HIVE-20529) Statistics update in S3 is taking time at target side during REPL Load
Date Tue, 11 Sep 2018 04:27:00 GMT

     [ https://issues.apache.org/jira/browse/HIVE-20529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

mahesh kumar behera updated HIVE-20529:
---------------------------------------
    Description: The statistics operations access the file system to get the number of files
created by the operation. In S3 it causes 2-3 seconds of delay. The file list can be obtained
from the event info in the replication directory and can be used to update the statistics.
 (was: Operations like insert and add partition creates a staging directory to generate the
files and then move the files created to actual location. In replication flow, the files are
first copied to the staging directory and then moved (rename) to the actual table location.
In case of S3, move is not an atomic operation. It internally does a copy and delete. So it
can not guarantee the consistency required. So it is better to copy the files directly to
the actual location. This will help in avoiding the staging directory creation (which takes
1-2 seconds in s3) and move (which takes time proportional to file size).)

> Statistics update in S3 is taking time at target side during REPL Load
> ----------------------------------------------------------------------
>
>                 Key: HIVE-20529
>                 URL: https://issues.apache.org/jira/browse/HIVE-20529
>             Project: Hive
>          Issue Type: Sub-task
>          Components: repl
>    Affects Versions: 4.0.0
>            Reporter: mahesh kumar behera
>            Assignee: mahesh kumar behera
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 4.0.0
>
>
> The statistics operations access the file system to get the number of files created by
the operation. In S3 it causes 2-3 seconds of delay. The file list can be obtained from the
event info in the replication directory and can be used to update the statistics.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message