hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From sam liu <samliuhad...@gmail.com>
Subject Re: Questions on rollback/upgrade HDFS with QJM HA enabled
Date Mon, 26 Jan 2015 06:27:55 GMT
For HDFS rollback with QJM HA enabled, I tried following steps, but failed:

0. Stop the whole Hadoop cluster
1. Update env parameters to use old Hadoop binaries
2. Start JNs:
sudo -u hdfs $HADOOP_HOME/sbin/hadoop-daemon.sh --config "$HADOOP_CONF_DIR"
start journalnode
3. Start the active NN with the '-rollback' flag:
sudo -u hdfs $HADOOP_HOME/bin/hadoop namenode -rollback

Note:
- This step passed, however the active NN stopped automatically. The msg is:
15/01/25 21:57:48 INFO namenode.FSImage: Rolling back storage directory
/hadoop/hdfs/name.
   new LV = -56; new CTime = 0
15/01/25 21:57:48 INFO namenode.NNUpgradeUtil: Rollback of
/hadoop/hdfs/name is complete.
15/01/25 21:57:48 INFO util.ExitUtil: Exiting with status 0
15/01/25 21:57:48 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at bdvs1194.vmware.com/9.30.249.194
************************************************************/
4. Start the active NN again:
sudo -u hdfs $HADOOP_HOME/sbin/hadoop-daemon.sh --config "$HADOOP_CONF_DIR"
start namenode
Note:
Failed to start NN again, the error msg is:
2015-01-25 21:59:06,745 ERROR
org.apache.hadoop.hdfs.server.namenode.EditLogInputStream: caught exception
initializing
http://hostname:8480/getJournal?jid=BICluster&segmentTxId=4924&storageInfo=-56%3A304881993%3A0%3ACID-652befc6-4b10-4b89-a6bc-a411af6ca4c8
org.apache.hadoop.hdfs.server.namenode.TransferFsImage$HttpGetFailedException:
Fetch of
http://hostname:8480/getJournal?jid=BICluster&segmentTxId=4924&storageInfo=-56%3A304881993%3A0%3ACID-652befc6-4b10-4b89-a6bc-a411af6ca4c8
failed with status code 403
Response message:
This node has namespaceId '0 and clusterId '' but the requesting node
expected '304881993' and 'CID-652befc6-4b10-4b89-a6bc-a411af6ca4c8'
        at
org.apache.hadoop.hdfs.server.namenode.EditLogFileInputStream$URLLog$1.run(EditLogFileInputStream.java:472)
        at
org.apache.hadoop.hdfs.server.namenode.EditLogFileInputStream$URLLog$1.run(EditLogFileInputStream.java:460)
        at
java.security.AccessController.doPrivileged(AccessController.java:369)


2015-01-26 10:26 GMT+08:00 sam liu <samliuhadoop@gmail.com>:

> Could any expert please help answer the questions?
>
> Thanks in advance!
>
> 2015-01-24 21:31 GMT+08:00 sam liu <samliuhadoop@gmail.com>:
>
>> Hi Experts,
>>
>> I have questions on rollback/upgrade HDFS with QJM HA enabled.
>>
>> On the website
>> http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithQJM.html#HDFS_UpgradeFinalizationRollback_with_HA_Enabled,
>> it says:
>> 'To perform a rollback of an upgrade, both NNs should first be shut down.
>> The operator should run the roll back command on the NN where they
>> initiated the upgrade procedure, which will perform the rollback on the
>> local dirs there, as well as on the shared log, either NFS or on the JNs.
>> Afterward, this NN should be started and the operator should run
>> `-bootstrapStandby' on the other NN to bring the two NNs in sync with this
>> rolled-back file system state.'
>>
>> Currently I expect the steps are(Please correct me if I am wrong):
>> NN1 -> hadoop namenode -rollback
>> NN1 -> hadoop namenode // In our env, this rollbacked namenode shuts down
>> right after it finishes -rollback so it needs to be started again.
>> NN2 -> hadoop namenode -bootstrapStandby
>> hadoop datanode -rollback // on all datanodes
>>
>> [Question 1]:
>> One thing I don't know is when the JournalNodes should be started and/or
>> stopped. It seems like they should be started for the hadoop namenode
>> -rollback. Should they be restarted sometime?
>>
>> [Question 2]:
>> Another issue actually happens after the upgrade and before rollback
>> starts: The standby NN process is actually heavily occupying the CPU and
>> somehow is eating up disk space (without the disk space actually being
>> used). This was causing "No space left on device" errors during the
>> rollback process.  As soon as I killed the namenode process, the disk space
>> was immediately back to a reasonable amount.
>> What might cause the NN process to occupy in a hidden way so much disk
>> space?
>>
>> Thanks!
>>
>
>

Mime
View raw message