hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Íñigo Goiri (JIRA) <j...@apache.org>
Subject [jira] [Updated] (HADOOP-15684) triggerActiveLogRoll stuck on dead name node, when ConnectTimeoutException happens.
Date Fri, 31 Aug 2018 17:33:00 GMT

     [ https://issues.apache.org/jira/browse/HADOOP-15684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Íñigo Goiri updated HADOOP-15684:
---------------------------------
    Status: Open  (was: Patch Available)

> triggerActiveLogRoll stuck on dead name node, when ConnectTimeoutException happens. 
> ------------------------------------------------------------------------------------
>
>                 Key: HADOOP-15684
>                 URL: https://issues.apache.org/jira/browse/HADOOP-15684
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: ha
>    Affects Versions: 3.0.0-alpha1
>            Reporter: Rong Tang
>            Assignee: Rong Tang
>            Priority: Critical
>         Attachments: 0001-RollEditLog-try-next-NN-when-exception-happens.patch, HADOOP-15684.000.patch,
HADOOP-15684.001.patch, HADOOP-15684.002.patch, hadoop--rollingUpgrade-SourceMachine001.log
>
>
> When name node call triggerActiveLogRoll, and the cachedActiveProxy is a dead name node,
it will throws a ConnectTimeoutException, expected behavior is to try next NN, but current
logic doesn't do so, instead, it keeps trying the dead, mistakenly take it as active.
>  
> 2018-08-17 10:02:12,001 WARN [Edit log tailer] org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer:
Unable to trigger a roll of the active NN
> org.apache.hadoop.net.ConnectTimeoutException: Call From SourceMachine001/SourceIP to001
TargetMachine001.ap.gbl:8020 failed on socket timeout exception: org.apache.hadoop.net.ConnectTimeoutException:
20000 millis timeout org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$2.doWork(EditLogTailer.java:298)
>  
> C:\Users\rotang>ping TargetMachine001
> Pinging TargetMachine001[TargetIP001] with 32 bytes of data:
>  Request timed out.
>  Request timed out.
>  Request timed out.
>  Request timed out.
>  Attachment is a log file saying how it repeatedly retries a dead name node, and a fix
patch.
>  I replaced the actual machine name/ip as SourceMachine001/SourceIP001 and TargetMachine001/TargetIP001.
>  
> How to Repro:
> In a good running NNs, take down the active NN (don't let it come back during test),
and then the stand by NNs will keep trying dead (old active) NN, because it is the cached
one.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org


Mime
View raw message