hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ravi Prakash (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-4882) Namenode LeaseManager checkLeases() runs into infinite loop
Date Tue, 11 Nov 2014 00:32:35 GMT

    [ https://issues.apache.org/jira/browse/HDFS-4882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14205669#comment-14205669
] 

Ravi Prakash commented on HDFS-4882:
------------------------------------

Thanks for your review Colin! Your understanding is correct. In this case, for a very strange
reason which I have as yet not been able to uncover, the FSNamesystem wasn't able to recover
the lease. I am investigating this root issue in HDFS-7342. In the meantime, however I'd argue
that the Namenode should never enter an infinite loop for whatever reason, and instead of
assuming that we have fixed all possible reasons why a lease couldn't be recovered, we should
relinquish the lock regularly. We should display on the webUI how many files are open for
writing and allow ops to forcibly close open files (HDFS-7307) . The way in which this error
happens (NN suddenly stops working) is egregious.

sortedLeases is being used externally in FSNamesystem.getCompleteBlocksTotal() as well . We
were also actively modifying it in checkLeases. I'm sure we can move things around to keep
using SortedSets, but I don't know if this Collection will ever really become too big for
the performance difference to matter. What do you think?

> Namenode LeaseManager checkLeases() runs into infinite loop
> -----------------------------------------------------------
>
>                 Key: HDFS-4882
>                 URL: https://issues.apache.org/jira/browse/HDFS-4882
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: hdfs-client, namenode
>    Affects Versions: 2.0.0-alpha, 2.5.1
>            Reporter: Zesheng Wu
>            Assignee: Ravi Prakash
>            Priority: Critical
>         Attachments: 4882.1.patch, 4882.patch, 4882.patch, HDFS-4882.1.patch, HDFS-4882.patch
>
>
> Scenario:
> 1. cluster with 4 DNs
> 2. the size of the file to be written is a little more than one block
> 3. write the first block to 3 DNs, DN1->DN2->DN3
> 4. all the data packets of first block is successfully acked and the client sets the
pipeline stage to PIPELINE_CLOSE, but the last packet isn't sent out
> 5. DN2 and DN3 are down
> 6. client recovers the pipeline, but no new DN is added to the pipeline because of the
current pipeline stage is PIPELINE_CLOSE
> 7. client continuously writes the last block, and try to close the file after written
all the data
> 8. NN finds that the penultimate block doesn't has enough replica(our dfs.namenode.replication.min=2),
and the client's close runs into indefinite loop(HDFS-2936), and at the same time, NN makes
the last block's state to COMPLETE
> 9. shutdown the client
> 10. the file's lease exceeds hard limit
> 11. LeaseManager realizes that and begin to do lease recovery by call fsnamesystem.internalReleaseLease()
> 12. but the last block's state is COMPLETE, and this triggers lease manager's infinite
loop and prints massive logs like this:
> {noformat}
> 2013-06-05,17:42:25,695 INFO org.apache.hadoop.hdfs.server.namenode.LeaseManager: Lease
[Lease.  Holder: DFSClient_NONMAPREDUCE_-1252656407_1, pendingcreates: 1] has expired hard
>  limit
> 2013-06-05,17:42:25,695 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Recovering
lease=[Lease.  Holder: DFSClient_NONMAPREDUCE_-1252656407_1, pendingcreates: 1], src=
> /user/h_wuzesheng/test.dat
> 2013-06-05,17:42:25,695 WARN org.apache.hadoop.hdfs.StateChange: DIR* NameSystem.internalReleaseLease:
File = /user/h_wuzesheng/test.dat, block blk_-7028017402720175688_1202597,
> lastBLockState=COMPLETE
> 2013-06-05,17:42:25,695 INFO org.apache.hadoop.hdfs.server.namenode.LeaseManager: Started
block recovery for file /user/h_wuzesheng/test.dat lease [Lease.  Holder: DFSClient_NONM
> APREDUCE_-1252656407_1, pendingcreates: 1]
> {noformat}
> (the 3rd line log is a debug log added by us)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message