hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Uma Maheswara Rao G (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-4504) DFSOutputStream#close doesn't always release resources (such as leases)
Date Thu, 22 Aug 2013 06:30:58 GMT

    [ https://issues.apache.org/jira/browse/HDFS-4504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13747288#comment-13747288

Uma Maheswara Rao G commented on HDFS-4504:

Thanks for thinking of this. Let me see if I can summarize the issue. If there is a streamer
failure, and the DFSClient calls completeFile, the last block in the file will transition
from state UNDER_CONSTRUCTION to state COMMITTED. This, in turn, will prevent later calls
made by the client to recoverLease from working, since we only do block recovery on blocks
in state UNDER_CONSTRUCTION or UNDER_RECOVERY. The ZombieStreamCloser will not be able to
run block recovery either, for the same reason. Is that a fair summary?

Really, the question is what is the right behavior in DFSOutputStream#close after a streamer
failure? Calling completeFile(force=false) seems wrong. We need to perform block recovery
in this scenario, as you said. Calling completeFile(force=true) will start block recovery
(it calls FSNamesystem#internalReleaseLease}}. That seems like the right thing to do.
Where we are not sure that clinet received last packet ack,  we should not call completeFile.
here complete file doing commit block thinking clinet already got ack for last packet that
means DN also would have finalized and for sure it will report in some time. So, in such cases
we should not go with completeFile and should do recover file lease  some how that should
initiate finalization at DN also. Please not that we tweak here for falling into that case
what Todd pointed earlier. May be better thing to check holder name. if file holder is current
holder, then only we should do recover file lease with new API.

It might make sense to create a new RPC with a different name than {[completeFile}}, to avoid
confusion with the other function of completeFile. But fundamentally, starting block recovery
is what we need to do here, and we might as well do it from DFSOutputStream#close. I think
this will solve the problem.

I think it may solve, But IMO, more simpler thing would be to just reassign lease holder at
NN with some name for this Zombie streams. In this case, NN will take care of recovering them
correctly. current clients renewLease will not renew the lease for this files. We can think
once on this option for more simplicity and less risk I feel.
ZombieStreameManger should just ensure it has informed successfully to NN about Zombie stream
instead of calling complete and others things can be same. We can think more if any other
impacts with this.

Currently I think(guess, but need to look once on this) if same holder is trying to recoverLease
from client, it may not allow as same client is trying to recover where same client was the
holder for that file. If yes, we need to allow this with above proposal by some indication.

> DFSOutputStream#close doesn't always release resources (such as leases)
> -----------------------------------------------------------------------
>                 Key: HDFS-4504
>                 URL: https://issues.apache.org/jira/browse/HDFS-4504
>             Project: Hadoop HDFS
>          Issue Type: Bug
>            Reporter: Colin Patrick McCabe
>            Assignee: Colin Patrick McCabe
>         Attachments: HDFS-4504.001.patch, HDFS-4504.002.patch, HDFS-4504.007.patch, HDFS-4504.008.patch,
HDFS-4504.009.patch, HDFS-4504.010.patch, HDFS-4504.011.patch, HDFS-4504.014.patch, HDFS-4504.015.patch,
> {{DFSOutputStream#close}} can throw an {{IOException}} in some cases.  One example is
if there is a pipeline error and then pipeline recovery fails.  Unfortunately, in this case,
some of the resources used by the {{DFSOutputStream}} are leaked.  One particularly important
resource is file leases.
> So it's possible for a long-lived HDFS client, such as Flume, to write many blocks to
a file, but then fail to close it.  Unfortunately, the {{LeaseRenewerThread}} inside the client
will continue to renew the lease for the "undead" file.  Future attempts to close the file
will just rethrow the previous exception, and no progress can be made by the client.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

View raw message