hadoop-hdfs-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Vinayakumar B <vinayakuma...@huawei.com>
Subject RE: confirm expected HA behavior
Date Thu, 06 Feb 2014 04:35:27 GMT
Hi Arpit,

In Your case, you blocked requests only on 8020 port. But ssh was reachable right?

Have you configured fencing method? Such as "sshfence"

If you have configured, then previous ActiveNN should be killed before making next one Active,
Else shared storage needs to handle single writer mechanism.

(if I am not wrong) QuorumJournalManager supports only one writer at a time and whenever another
NN becomes Active it will fence the old writer. Hence in your case Old ActiveNN(NN1)'s requests
got rejected by Journal Nodes And it got shutdown.

I think this behavior is appropriate.

Is there any problem you are seeing here?

Regards,
Vinayakumar B

-----Original Message-----
From: Arpit Gupta [mailto:arpit@hortonworks.com] 
Sent: 06 February 2014 03:58
To: hdfs-dev@hadoop.apache.org
Subject: confirm expected HA behavior

Hi

I have a scenario where i am trying to test how HDFS HA works in case of network issues. I
used iptables to block requests to the rpc port 8020 in order to simulate that. Below is the
some info on what i did.


NN1 - Active
NN2 - Standby

Using iptables stop port 8020 on NN1 (http://stackoverflow.com/questions/7423309/iptables-block-access-to-port-8000-except-from-ip-address)
iptables -A INPUT -p tcp --dport 8020 -j DROP

NN2 transitions to active.

Run the following command to allow requests to port 8020 (http://stackoverflow.com/questions/10197405/iptables-remove-specific-rules)
iptables -D INPUT -p tcp --dport 8020 -j DROP

After this NN1 shut itself down with 

2014-02-05 01:00:38,030 FATAL namenode.FSEditLog (JournalSet.java:mapJournalsAndReportErrors(354))
- Error: flush failed for required journal (JournalAndStream(mgr=QJM to [IP:8485], stream=QuorumOutputStream
starting at txid 568))
org.apache.hadoop.hdfs.qjournal.client.QuorumException: Got too many exceptions to achieve
quorum size 1/1. 1 exceptions thrown:
68.142.244.23:8485: IPC's epoch 1 is less than the last promised epoch 2
        at org.apache.hadoop.hdfs.qjournal.server.Journal.checkRequest(Journal.java:410)


NN1 in this case shuts down with the above exception as it still believes its active hence
there is an exception when talking to JN's. Thus the operators would have restart NN1 which
could take a while based on the image size. Hence i was wondering if there is a better way
to handle the above case where we may be transition to standby if exceptions like above are
seen.


Wanted to get thoughts of others before i opened a an enhancement request.

Thanks
--
Arpit Gupta
Hortonworks Inc.
http://hortonworks.com/


--
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to which it is addressed
and may contain information that is confidential, privileged and exempt from disclosure under
applicable law. If the reader of this message is not the intended recipient, you are hereby
notified that any printing, copying, dissemination, distribution, disclosure or forwarding
of this communication is strictly prohibited. If you have received this communication in error,
please contact the sender immediately and delete it from your system. Thank You.

Mime
View raw message