hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alexandru Pacurar <Alexandru.Pacu...@PropertyShark.com>
Subject Question about namenode HA
Date Fri, 05 Dec 2014 09:24:41 GMT
Hello,

I'm trying to configure HA for the HDFS namenode with QJM following the instructions form
here http://hadoop.apache.org/docs/r2.2.0/hadoop-yarn/hadoop-yarn-site/HDFSHighAvailabilityWithQJM.html.

My setup is the following : Ubuntu 12.04.5 LTS on all the nodes, Hadoop 2.4.1 installed, two
namenodes (QJM processes run on this), one machine for a third QJM.

Initially we didn't have HA, so this is a migration from a non-HA enabled cluster to a HA
enabled one.

For the migration I :

*         added all the necessary configuration specified in the link above

*         stopped the non-HA cluster

*         started the three QJMs

*         started my first namenode(the one that was the only namenode in the non-HA setup)
with the new configs.

*         On my second namenode I ran hdfs namenode -bootstrapStandby which copied the fsimage,
and went ok

*         Also on my secondary I ran hdfs namenode -initializeSharedEdits which initialized
all three of my QJMs

*         Then I started the secondary namenode.

After this I started to have some problems. Both nodes were in standby with the following
WARN :
"2014-12-04 13:35:56,074 WARN org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer: Unable
to trigger a roll of the active NN
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): Operation category
JOURNAL is not supported in state standby"

After half an hour of this I thought I could just move one of them into primary because I'm
thinking based on the warning that it should solve the problem. So I ran hdfs haadmin -transitionToActive
node1, but this gave me the following fatal error, which I haven't been able to figure out:

2014-12-04 14:16:55,835 FATAL org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer: Unknown
error encountered while tailing edits. Shutting down standby NN.
java.io.IOException: There appears to be a gap in the edit log.  We expected txid 1, but got
txid 1542903

Now If I try to restart the secondary, it just gives me the same error, and if I try to restart
my other node which is still running I get the same.

The thing is that before configuring the HA my dfs.data.dir had only this file of edits edits_inprogress_0000000000000000001,
so it should start at txid 1. After I initialize the Shared Edits it jumps to  edits_0000000000001542903-0000000000001542904.

Could anyone shed some light on this issue for me?

Thank you,
Alex

Mime
View raw message