hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Haohui Mai (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-9763) Add merge api
Date Sat, 27 Feb 2016 07:25:18 GMT

    [ https://issues.apache.org/jira/browse/HDFS-9763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15170419#comment-15170419

Haohui Mai commented on HDFS-9763:

I agree with [~cmccabe]. I don't think this is a good idea.

The concerns are definitely valid. Addressing them by setting arbitrary usually indicating
the design is problematic.

If the whole point is to batch RPC and avoid TOCTOU, maybe you want to adpot the design of
transactional file systems.


> Add merge api
> -------------
>                 Key: HDFS-9763
>                 URL: https://issues.apache.org/jira/browse/HDFS-9763
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>          Components: fs
>            Reporter: Ashutosh Chauhan
>            Assignee: Xiaobing Zhou
>         Attachments: HDFS_Merge_API_Proposal.pdf
> It will be good to add merge(Path dir1, Path dir2, ... ) api to HDFS. Semantics will
be to move all files under dir1 to dir2 and doing a rename of files in case of collisions.
> In absence of this api, Hive[1] has to check for collision for each file and then come
up unique name and try again and so on. This is inefficient in multiple ways:
> 1) It generates huge number of calls on NN (atleast 2*number of source files in dir1)
> 2) It suffers from TOCTOU[2] bug for client picked up name in case of collision.
> 3) Whole operation is not atomic.
> A merge api outlined as above will be immensely useful for Hive and potentially to other
HDFS users.
> [1] https://github.com/apache/hive/blob/release-2.0.0-rc1/ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java#L2576
> [2] https://en.wikipedia.org/wiki/Time_of_check_to_time_of_use

This message was sent by Atlassian JIRA

View raw message