hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Wellington Chevreuil <wellington.chevre...@gmail.com>
Subject Re: HDFS HA(Based on QJM) Failover Frequently with Large FSimage and Busy Requests
Date Wed, 03 May 2017 11:58:18 GMT
Hi Yizhou,

Yes, this might be causing the failovers. I've seen situations where download of large fsimage
from SBNN, plus additional requests to ANN led to longer disk latency, which caused any Service
RPC request that require an HDFS WRITE LOCK to take longer to be processed. This can cause
failover if all service RPC handlers stay busy for longer than the 45 seconds timeout from
FC, so that FC request stay all that time on the queue.

You may be able to confirm on this further by collecting jstack of ANN (you would need a few
jstacks from covering the failover period). The pattern in the jstacks would be that all but
one RPC service handler thread would be waiting on same lock, while only one would be runnable.

You might also want to check for processes blocked message on dmesg output. If there are no
messages there, change hung_task_timeout_secs to 40 secs until the next failover, so that
you could catch a potential OS pause causing the failover. This may be an indication of file
system cache flushes, as described below:



> On 26 Apr 2017, at 23:41, Anu Engineer <aengineer@hortonworks.com> wrote:
> 1.ANN(active namenode) downloading fsimage.ckpt_* from SNN(standby namenode) leads to
very high disk io, at the same time, zkfc fails to monitor the health of ann due to timeout.
Is there any releationship between high disk io and zkfc monitor request timeout? Every failover
happened when ckpt download, but not every ckpt download leads to failover.

View raw message