hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Colin Patrick McCabe (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-9763) Add merge api
Date Wed, 02 Mar 2016 19:55:18 GMT

    [ https://issues.apache.org/jira/browse/HDFS-9763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15176355#comment-15176355
] 

Colin Patrick McCabe commented on HDFS-9763:
--------------------------------------------

-1 for the proposed merge API for the reasons [~wheat9] and I stated earlier.  It's complicated,
Hive-specific, locks us into the current Hive semantics, and isn't needed to address the TOCTOU.

If the goal is reducing the number of RPCs to the NameNode that Hive makes, there are much
simpler ways to do that... like allowing a single RPC to contain multiple HDFS requests. 
We could just have a generic batch API that allows the client to send multiple requests as
part of a "batch".  Sending a bunch of renames in one RPC would just be one use for this API.
 It would be useful for applications other than Hive, and would allow Hive to change its merge
semantics over time without modifying the source code of HDFS.

> Add merge api
> -------------
>
>                 Key: HDFS-9763
>                 URL: https://issues.apache.org/jira/browse/HDFS-9763
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>          Components: fs
>            Reporter: Ashutosh Chauhan
>            Assignee: Xiaobing Zhou
>         Attachments: HDFS_Merge_API_Proposal.pdf
>
>
> It will be good to add merge(Path dir1, Path dir2, ... ) api to HDFS. Semantics will
be to move all files under dir1 to dir2 and doing a rename of files in case of collisions.
> In absence of this api, Hive[1] has to check for collision for each file and then come
up unique name and try again and so on. This is inefficient in multiple ways:
> 1) It generates huge number of calls on NN (atleast 2*number of source files in dir1)
> 2) It suffers from TOCTOU[2] bug for client picked up name in case of collision.
> 3) Whole operation is not atomic.
> A merge api outlined as above will be immensely useful for Hive and potentially to other
HDFS users.
> [1] https://github.com/apache/hive/blob/release-2.0.0-rc1/ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java#L2576
> [2] https://en.wikipedia.org/wiki/Time_of_check_to_time_of_use



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message