Mailing-List: contact hdfs-issues-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: hdfs-issues@hadoop.apache.org
Date: Thu, 22 Aug 2013 06:30:58 +0000 (UTC)
From: "Uma Maheswara Rao G (JIRA)" <jira@apache.org>
To: hdfs-issues@hadoop.apache.org
Message-ID: <JIRA.12632534.1360910400548.15518.1377153058439@arcas>
In-Reply-To: <JIRA.12632534.1360910400548@arcas>
References: <JIRA.12632534.1360910400548@arcas>
Subject: [jira] [Commented] (HDFS-4504) DFSOutputStream#close doesn't always
 release resources (such as leases)
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


    [ https://issues.apache.org/jira/browse/HDFS-4504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13747288#comment-13747288 ] 

Uma Maheswara Rao G commented on HDFS-4504:
-------------------------------------------

{quote}
Thanks for thinking of this. Let me see if I can summarize the issue. If there is a streamer failure, and the DFSClient calls completeFile, the last block in the file will transition from state UNDER_CONSTRUCTION to state COMMITTED. This, in turn, will prevent later calls made by the client to recoverLease from working, since we only do block recovery on blocks in state UNDER_CONSTRUCTION or UNDER_RECOVERY. The ZombieStreamCloser will not be able to run block recovery either, for the same reason. Is that a fair summary?
{quote}
Yes.

{quote}
Really, the question is what is the right behavior in DFSOutputStream#close after a streamer failure? Calling completeFile(force=false) seems wrong. We need to perform block recovery in this scenario, as you said. Calling completeFile(force=true) will start block recovery (it calls FSNamesystem#internalReleaseLease}}. That seems like the right thing to do.
{quote}
Where we are not sure that clinet received last packet ack,  we should not call completeFile. here complete file doing commit block thinking clinet already got ack for last packet that means DN also would have finalized and for sure it will report in some time. So, in such cases we should not go with completeFile and should do recover file lease  some how that should initiate finalization at DN also. Please not that we tweak here for falling into that case what Todd pointed earlier. May be better thing to check holder name. if file holder is current holder, then only we should do recover file lease with new API.

{quote}
It might make sense to create a new RPC with a different name than {[completeFile}}, to avoid confusion with the other function of completeFile. But fundamentally, starting block recovery is what we need to do here, and we might as well do it from DFSOutputStream#close. I think this will solve the problem.
{quote}

I think it may solve, But IMO, more simpler thing would be to just reassign lease holder at NN with some name for this Zombie streams. In this case, NN will take care of recovering them correctly. current clients renewLease will not renew the lease for this files. We can think once on this option for more simplicity and less risk I feel.
ZombieStreameManger should just ensure it has informed successfully to NN about Zombie stream instead of calling complete and others things can be same. We can think more if any other impacts with this.

Currently I think(guess, but need to look once on this) if same holder is trying to recoverLease from client, it may not allow as same client is trying to recover where same client was the holder for that file. If yes, we need to allow this with above proposal by some indication.


> DFSOutputStream#close doesn't always release resources (such as leases)
> -----------------------------------------------------------------------
>
>                 Key: HDFS-4504
>                 URL: https://issues.apache.org/jira/browse/HDFS-4504
>             Project: Hadoop HDFS
>          Issue Type: Bug
>            Reporter: Colin Patrick McCabe
>            Assignee: Colin Patrick McCabe
>         Attachments: HDFS-4504.001.patch, HDFS-4504.002.patch, HDFS-4504.007.patch, HDFS-4504.008.patch, HDFS-4504.009.patch, HDFS-4504.010.patch, HDFS-4504.011.patch, HDFS-4504.014.patch, HDFS-4504.015.patch, HDFS-4504.016.patch
>
>
> {{DFSOutputStream#close}} can throw an {{IOException}} in some cases.  One example is if there is a pipeline error and then pipeline recovery fails.  Unfortunately, in this case, some of the resources used by the {{DFSOutputStream}} are leaked.  One particularly important resource is file leases.
> So it's possible for a long-lived HDFS client, such as Flume, to write many blocks to a file, but then fail to close it.  Unfortunately, the {{LeaseRenewerThread}} inside the client will continue to renew the lease for the "undead" file.  Future attempts to close the file will just rethrow the previous exception, and no progress can be made by the client.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira