Mailing-List: contact hdfs-issues-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: hdfs-issues@hadoop.apache.org
Date: Fri, 5 Feb 2016 01:55:39 +0000 (UTC)
From: "Ashutosh Chauhan (JIRA)" <jira@apache.org>
To: hdfs-issues@hadoop.apache.org
Message-ID: <JIRA.12936934.1454637296000.307151.1454637339841@Atlassian.JIRA>
In-Reply-To: <JIRA.12936934.1454637296000@Atlassian.JIRA>
References: <JIRA.12936934.1454637296000@Atlassian.JIRA>
 <JIRA.12936934.1454637296724@arcas>
Subject: [jira] [Updated] (HDFS-9763) Add merge api
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


     [ https://issues.apache.org/jira/browse/HDFS-9763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ashutosh Chauhan updated HDFS-9763:
-----------------------------------
    Description: 
It will be good to add merge(Path dir1, Path dir2, ... ) api to HDFS. Semantics will be to move all files under dir1 to dir2 and doing a rename of files in case of collisions.
In absence of this api, Hive[1] has to check for collision for each file and then come up unique name and try again and so on. This is inefficient in multiple ways:

1) It generates huge number of calls on NN (atleast 2*number of source files in dir1)
2) It suffers from TOCTOU[2] bug for client picked up name in case of collision.
3) Whole operation is not atomic.

A merge api outlined as above will be immensely useful for Hive and potentially to other HDFS users.

[1] https://github.com/apache/hive/blob/release-2.0.0-rc1/ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java#L2576
[2] https://en.wikipedia.org/wiki/Time_of_check_to_time_of_use

  was:
It will be good to add merge(Path dir1, Path dir2, ... ) api to HDFS. Semantics will be to move all files under dir1 to dir2 and doing a rename of files in case of collisions.
In absence of this api, Hive[1] has to check for collision for each file and then come up unique name and try again and so on. This is inefficient in multiple ways:

1) It generates huge number of calls on NN (atleast 2*number of source files in dir1)
2) It suffers from TOCTOU[2] bug for client picked up name in case of collision.
3) Whole operation is not atomic.

A merge api outlined as above will be immensely useful for Hive and potentially to other HDFS users.

[1] https://github.com/apache/hive/blob/release-2.0.0-rc1/ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java#L2576
[2]https://en.wikipedia.org/wiki/Time_of_check_to_time_of_use


> Add merge api
> -------------
>
>                 Key: HDFS-9763
>                 URL: https://issues.apache.org/jira/browse/HDFS-9763
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>          Components: fs
>            Reporter: Ashutosh Chauhan
>
> It will be good to add merge(Path dir1, Path dir2, ... ) api to HDFS. Semantics will be to move all files under dir1 to dir2 and doing a rename of files in case of collisions.
> In absence of this api, Hive[1] has to check for collision for each file and then come up unique name and try again and so on. This is inefficient in multiple ways:
> 1) It generates huge number of calls on NN (atleast 2*number of source files in dir1)
> 2) It suffers from TOCTOU[2] bug for client picked up name in case of collision.
> 3) Whole operation is not atomic.
> A merge api outlined as above will be immensely useful for Hive and potentially to other HDFS users.
> [1] https://github.com/apache/hive/blob/release-2.0.0-rc1/ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java#L2576
> [2] https://en.wikipedia.org/wiki/Time_of_check_to_time_of_use


--
This message was sent by Atlassian JIRA
(v6.3.4#6332)