hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Eli Collins (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-8598) Server-side Trash
Date Sun, 15 Jul 2012 23:35:34 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-8598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13414808#comment-13414808
] 

Eli Collins commented on HADOOP-8598:
-------------------------------------

Forgot to mention that the pluggae trash policy makes less sense server side, should probably
be replaced with a delete hook in FsShell since reasonable policies might want to do things
that ant run in the NN.
                
> Server-side Trash
> -----------------
>
>                 Key: HADOOP-8598
>                 URL: https://issues.apache.org/jira/browse/HADOOP-8598
>             Project: Hadoop Common
>          Issue Type: New Feature
>    Affects Versions: 2.0.0-alpha
>            Reporter: Eli Collins
>            Assignee: Eli Collins
>            Priority: Critical
>
> There are a number of problems with Trash that continue to result in permanent data loss
for users. The primary reasons trash is not used:
> - Trash is configured client-side and not enabled by default.
> - Trash is shell-only. FileSystem, WebHDFS, HttpFs, etc never use trash.
> - If trash fails, for example, because we can't create the trash directory or the move
itself fails, trash is bypassed and the data is deleted.
> Trash was designed as a feature to help end users via the shell, however in my experience
the primary use of trash is to help administrators implement data retention policies (this
was also the motivation for HADOOP-7460).  One could argue that (periodic read-only) snapshots
are a better solution to this problem, however snapshots are not slated for Hadoop 2.x and
trash is complimentary to snapshots (and backup) - eg you may create and delete data within
your snapshot or backup window - so it makes sense to revisit trash's design. I think it's
worth bringing trash's functionality in line with what users need.
> I propose we enable trash on a per-filesystem basis and implement it server-side. Ie
trash becomes an HDFS feature enabled by administrators. Because the trash emptier lives in
HDFS and users already have a per-filesystem trash directory we're mostly there already. The
design preference from HADOOP-2514 was for trash to be implemented in "user code" however
(a) in light of these problems, (b) we have a lot more user-facing APIs than the shell and
(c) clients increasingly span file systems (via federation and symlinks) this design choice
makes less sense. This is why we already use a per-filesystem trash/home directory instead
of the user's client-configured one - otherwise trash would not work because renames can't
span file systems.
> In short, HDFS trash would work similarly to how it does today, the difference is that
client delete APIs would result in a rename into trash (ala TrashPolicyDefault#moveToTrash)
if trash is enabled. Like today it would be renamed to the trash directory on the file system
where the file being removed resides. The primary difference is that enablement and policy
are configured server-side by adminstrators and is used regardless of the API used to access
the filesytem. The one execption to this is that I think we should continue to support the
explict skipTrash shell option. The rationale for skipTrash (HADOOP-6080) is that a move to
trash may fail in cases where a rm may not, if a user has a home directory quota and does
a rmr /tonsOfData, for example. Without a way to bypass this the user has no way (unless we
revisit quotas, permissions or trash paths) to remove a directory they have permissions to
remove without getting their quota adjusted by an admin. The skip trash API can be implemented
by adding an explicit FileSystem API that bypasses trash and modifying the shell to use it
when skipTrash is enabled. Given that users must explicitly specify skipTrash the API is less
error prone. We could have the shell ask confirmation and annotate the API private to FsShell
to discourage programatic use. This is not ideal but can be done compatibly (unlike redefining
quotas, permissions or trash paths).
> In terms of compatibility, while this proposal is technically an incompatible change
(client side configuration that disables trash and uses skipTrash with a previous FsShell
release will now both be ignored if server-side trash is enabled, and non-HDFS file systems
would need to make similar changes) I think it's worth targeting for Hadoop 2.x given that
the new semantics preserve the current semantics. In 2.x I think we should preserve FsShell
based trash and support both it and server-side trash (defaults to disabled). For trunk/3.x
I think we should remove the FsShell based trash entirely and enable server-side trash by
default.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

Mime
View raw message