hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Colin Patrick McCabe (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-3107) HDFS truncate
Date Mon, 29 Sep 2014 19:11:38 GMT

    [ https://issues.apache.org/jira/browse/HDFS-3107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152094#comment-14152094

Colin Patrick McCabe commented on HDFS-3107:

Thanks for waiting.  I'm checking out the design doc.

In proposed approach truncate is performed only on a closed file. If the file is opened for
write an 
attempt to truncate fails.

Just a style change, but maybe "Truncate cannot be performed on a file which is currently
open for writing" would be clearer.

Conceptually, truncate removes all full blocks of the file and then starts a recovery process
for the 
last block if it is not fully truncated. The truncate recovery is similar to standard HDFS
lease recovery
procedure. That is, NameNode sends a DatanodeCommand to one of the DataNodes containing block

replicas. The primary DataNode synchronizes the new length among the replicas, and then confirms
it to 
the NameNode by sending commitBlockSynchronization() message, which completes the 
truncate. Until the truncate recovery is complete the file is assigned a lease, which revokes
the ability for 
other clients to modify that file.

I think a diagram might help here.  The impression I'm getting is that we have some "truncation
point" like this:
               truncation point
| A   | B   | C   | D   | E   | F   |
In this case, blocks E and F would be invalidated by the NameNode, and block recovery would
begin on block D?

"Conceptually, truncate removes all full blocks of the file" seems to suggest we're removing
all blocks, so it might be nice to rewrite this as "Truncate removes all full blocks after
the truncation point."

 Full blocks if any are deleted instantaneously. And if there is nothing more to truncate
NameNode returns success to the client.

They're invalidated instantly, but not deleted instantly, right?  Clients may still be reading
from them on the various datanodes.

public boolean truncate(Path src, long newLength)
throws IOException;
Truncate file src to the specified newLength.
- true if the file have been truncated to the desired newLength and is immediately available
be reused for write operations such as append, or
- false if a background process of adjusting the length of the last block has been started,
clients should wait for it to complete before they can proceed with further file updates.

Hmm, do we really need the boolean here?  It seems like the client could simply try to reopen
the file until it no longer got an {{RecoveryInProgressException.}} (or lease exception, as
the case may be.)  The client will have to do this anyway most of the time, since most truncates
don't fall on even block boundaries.

 It should be noted that applications that cache data may still see old bytes of the file
in the cache. It is advised for such applications to incorporate techniques, which would retire
when the data is truncated.

One issue that I see here is that {{DFSInputStream}} users will continue to see the old, longer
length for a long time potentially.  {{DFSInputStream#locatedBlocks}} will continue to have
the block information it had prior to truncation.  And eventually, whenever they try to read
from that longer length, they'll get read failures since the blocks will actually be unlinked.
 These will look like IOExceptions to the user.  I don't know if there's a good way around
this problem with the design proposed here.

bq. \[truncate with snapshots\]

I don't think we should commit anything to trunk until we figure out how this integrates with
snapshots.  It just impacts the design too much.  When you start seriously thinking about
snapshots, integrating this with block recovery (by adding {{BEING_TRUNCATED}}, etc.) does
not look like a very good option.  A better option would be simply to copy the partial block
and have the snapshotted version reference the old block, and the new version reference the
(shorter) copy.  That corresponds to your approach #3, right?  truncate is presumably a rare
operation and doing the truncation in-place for non-snapshotted files is an optimization we
could do later.

The copy approach is also nice for {{DFSInputStream}}, since readers can continue reading
from the old (longer) copy until the readers close.  If we truncated that copy directly, this
would not work.

We could commit this to a branch, but I think we should hold off on committing to trunk until
we figure out the snapshot story.

> HDFS truncate
> -------------
>                 Key: HDFS-3107
>                 URL: https://issues.apache.org/jira/browse/HDFS-3107
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>          Components: datanode, namenode
>            Reporter: Lei Chang
>            Assignee: Plamen Jeliazkov
>         Attachments: HDFS-3107.patch, HDFS-3107.patch, HDFS-3107.patch, HDFS-3107.patch,
HDFS-3107.patch, HDFS_truncate.pdf, HDFS_truncate_semantics_Mar15.pdf, HDFS_truncate_semantics_Mar21.pdf,
>   Original Estimate: 1,344h
>  Remaining Estimate: 1,344h
> Systems with transaction support often need to undo changes made to the underlying storage
when a transaction is aborted. Currently HDFS does not support truncate (a standard Posix
operation) which is a reverse operation of append, which makes upper layer applications use
ugly workarounds (such as keeping track of the discarded byte range per file in a separate
metadata store, and periodically running a vacuum process to rewrite compacted files) to overcome
this limitation of HDFS.

This message was sent by Atlassian JIRA

View raw message