hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jesse Yates (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-3370) HDFS hardlink
Date Wed, 27 Jun 2012 22:26:44 GMT

    [ https://issues.apache.org/jira/browse/HDFS-3370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13402620#comment-13402620
] 

Jesse Yates commented on HDFS-3370:
-----------------------------------

I'd like to propose an alternative to 'real' hardlinkes: "reference counted soft-Links", or
all the hardness you really need in a distributed FS.

In this implementation of "hard" links, I would propose that wherever the file is created
is considered the "owner" of that file. Initially, when created, the file has a reference
count of (1) on the local namespace. If you want another hardlink to the file in the same
namespace, you then talk to the NN and request another handle to that file, which implicitly
updates the references to the file. The reference to that file could be stored in memory (and
journaled) or written as part of the file metadata (more on that later, but lets ignore that
for the moment). 

Suppose instead that you are in a separate namespace and want a hardlink to the file in the
original namespace. Then you would make a request to your NN (NNa) for a hardlink. Since NNa
doesn't own the file you want to reference, it makes a hardlink request to NN which originally
created the file, the file 'owner' (or NNb). NNb then says 'Cool, I've got your request and
increment the ref-count for the file." NNa can then grant your request and give you a link
to that file. The failure case here is either
1) NNb goes down, in which case you can just keep around the reference requests and batch
them when NNb comes back up.
2) NNa goes down mid-request - if NNa doesn't recieve an ACK back for the granted request,
it can then disregard that request and re-decrement the count for that hardlink. 

Deleting the hardlink then follows a similar process. You issue a request to the owner NN,
either directly from the client if you are deleting a link in the current namespace or through
a proxy NN to the original namenode. It then decrements the reference count on the file and
allows the deletion of the link. If the reference count ever hits 0, then the NN also deletes
the file since there are no valid references to that file. 

This has the implicit implication though that the file will not be visible in the namespace
that created it if all the hardlinks to it are removed. This means it essentially becomes
a 'hidden' inode. We could, in the future, also work out a mechanism to transfer the hidden
inode to a NN that has valid references to it (maybe via a gossip-style protocol), but that
would be out of the current scope.

There are some implications for this model. If the owner NN manages the ref-count in memory,
if that NN goes down, its whole namespace then becomes inaccessible, including _creating new
hardlinks_ to any of the files (inodes) that it owns. However, the owner NN going down doesn't
preclude the other NN from serving the file from their own 'soft' inodes. 

Alternatively, the NN could have a lock on the a hardlinked file, with the ref-counts and
ownership info in the file metadata. This might introduce some overhead when creating new
hardlinks (you need to reopen and modify the block or write a new block with the new information
periodically - this latter actually opens a route to do ref-count management via appends to
a file-ref file), but has the added advantage that if the owner NN crashed, an alternative
NN could some and claim ownership of that file. This is similar to doing Paxos style leader-election
for a given hardlinked file combined with leader-leases. However, this very unlikely to see
lots of fluctuation as the leader can just reclaim the leader token via appends to the file-owner
file, with periodic rewrites to minimize file size. 

The on-disk representation of the extreme version I'm proposing is then this: the full file
then is actually composed of three pieces: (1) the actual data and then two metadata files,
"extents" (to add a new word/definition),  (2) an external-reference extent: each time a reference
is made to the file a new count is appended and it can periodically recompacted to a single
value, (3) an owner-extent with the current NN owner and the lease time on the file, dictating
who controls overall deletion of the file (since ref counts are done via the external-ref
file). This means (2) and (3) are hidden inodes, only accessible to the namenode. We can minimize
overhead to these file extents by ensuring a single writer via messaging to the the owner
NN (as specified by the owner-file), though this is not strictly necessary.

Further, (1) could become a hidden inode if all the local namespace references are removed,
but it could eventually be transferred over to another NN shard (namespace) to keep overhead
at a minimum, though (again), this is not a strict necessity.

The design retains the NN view of files as directory entries, just entries with a little bit
of metadata. The metadata could be in memory or part of the file and periodically modified,
but that’s more implementation detail than anything (as mentioned above).
                
> HDFS hardlink
> -------------
>
>                 Key: HDFS-3370
>                 URL: https://issues.apache.org/jira/browse/HDFS-3370
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>            Reporter: Hairong Kuang
>            Assignee: Liyin Tang
>         Attachments: HDFS-HardLink.pdf
>
>
> We'd like to add a new feature hardlink to HDFS that allows harlinked files to share
data without copying. Currently we will support hardlinking only closed files, but it could
be extended to unclosed files as well.
> Among many potential use cases of the feature, the following two are primarily used in
facebook:
> 1. This provides a lightweight way for applications like hbase to create a snapshot;
> 2. This also allows an application like Hive to move a table to a different directory
without breaking current running hive queries.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

Mime
View raw message