hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sanjay Radia (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-11452) Revisit FileSystem.rename(path, path, options)
Date Thu, 05 Jan 2017 23:41:58 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-11452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15802888#comment-15802888

Sanjay Radia commented on HADOOP-11452:

Steve suggested:
bq. note that we could consider adding a new enum operation Rename.ATOMIC_REQUIRED which will
fail if atomicity is not supported

We had considered such things (and this specific one) multiple times in the past,  in the
context of S3 and also the local file system for not just rename but also other methods. Neither
local fs or S3 have exactly the same semantics as HDFS for each method.   *Here is the main
issue:* File systems like LocalFIlesystem is used for testing apps and for a long time S3
was used for simply testing or for non-critical usage on the cloud. Folks were willing to
live with the occasional inconsistency or with the performance overhead of say copy-delete
for rename on S3.  If  applications like  hive or Spark used the rename.ATOMIC_REQUIRED on
then the app would just fail on S3 and those use cases (testing, non-critical or willing to
live with the performance overhead) would not be supported and its users would be unhappy.

Now that users want to run production apps on cloud storage like S3,  apps like Hive are being
modified to run well on S3 by changing how they do commit (say via the metastore or a menifest
file instead of the rename). 

So adding the Rename.ATOMIC_REQUIRED flag is easy. But is it going to be useful? Please articulate
how it will be used. For example if we were to change Hive to use Rename.ATOMIC_REQUIRED then
Hive will just fail on S3.

So I think we should continue to make progress on Hive, Spark and others to run first class
on S3. I dont think Rename.ATOMIC_REQUIRED helps. I believe it make sense to have an FS.whatFeaturesDoYouSupport()
API so that an app like Hive could be implemented to run first class on HDFS, S3, AzureBlobStoage
etc by querying the FS features and then using a  different implementation for say committing
the output of a job. In some cases it may be better to use a totally different approach that
works on all FSs such as a manifest file or depend on Hive Metastore to commit . (Turns out
hive needs to be able to commit multiple tables and hence even the rename-dir is not good

> Revisit FileSystem.rename(path, path, options)
> ----------------------------------------------
>                 Key: HADOOP-11452
>                 URL: https://issues.apache.org/jira/browse/HADOOP-11452
>             Project: Hadoop Common
>          Issue Type: Task
>          Components: fs
>    Affects Versions: 2.7.3
>            Reporter: Yi Liu
>            Assignee: Steve Loughran
>         Attachments: HADOOP-11452-001.patch, HADOOP-11452-002.patch
> Currently in {{FileSystem}}, {{rename}} with _Rename options_ is protected and with _deprecated_
annotation. And the default implementation is not atomic.
> So this method is not able to be used outside. On the other hand, HDFS has a good and
atomic implementation. (Also an interesting thing in {{DFSClient}}, the _deprecated_ annotations
for these two methods are opposite).
> It makes sense to make public for {{rename}} with _Rename options_, since it's atomic
for rename+overwrite, also it saves RPC calls if user desires rename+overwrite.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org

View raw message