hadoop-common-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Lucene-hadoop Wiki] Update of "Hadoop 0.14 Upgrade" by RaghuAngadi
Date Tue, 21 Aug 2007 21:46:29 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Lucene-hadoop Wiki" for change notification.

The following page has been changed by RaghuAngadi:
http://wiki.apache.org/lucene-hadoop/Hadoop_0%2e14_Upgrade

The comment on the change is:
First Version of upgrade guide for 0.14

------------------------------------------------------------------------------
  = Upgrade Guide for Hadoop-0.14 =
- 
- '''XXX This document is still under development'''. Should be complete by end of Aug 21st.
  
  This page describes upgrade information that is specific to Hadoop-0.14. The usual upgrade
described in [:Hadoop_Upgrade: Hadoop Upgrade page] still applies for Hadoop-0.14. 
  
  == Brief Upgrade Procedure ==
  
- In most cases, upgrade to Hadoop-0.14 completes without any problems. In these case, administrators
do not need to rest of the sections in this document. The simple upgrade steps are same as
listed in [:Hadoop_Upgrade:Hadoop Upgrade]:
+ In most cases, upgrade to Hadoop-0.14 completes without any problems. In these cases, administrators
are not required to be very familiar with rest of the sections in this document. The simple
upgrade steps are same as listed in [:Hadoop_Upgrade:Hadoop Upgrade]:
   
     1. If you are running Hadoop-0.13.x, make sure the cluster is finalized.   
     1. Stop map-reduce cluster(s) and all client applications running on the DFS cluster.
@@ -16, +14 @@

     1. Install new version of Hadoop software.
     1. Start DFS cluster with –upgrade option.
     1. Wait for for cluster upgrade to complete.
-    6. Start map-reduce cluster.
+    1. Start map-reduce cluster.
-    7. Verify the components run properly and finalize the upgrade when convinced.
+    1. Verify the components run properly and finalize the upgrade when convinced.
  
  The rest of the document describes what happens once the cluster is started with {{{-upgrade}}}
option.
  
@@ -28, +26 @@

  Depending on number of blocks and number of files in HDFS, upgrade can take anywhere from
a few minutes to a few hours.
  
  There are three stages in this upgrade :
-  1. '''SafeMode''' : Similar to normal restart of the cluster, namenode waits for datanodes
in the cluster to report their blocks. The cluster may wait in the state for a long time if
some of the datanodes do not report their blocks. 
+  1. '''Safe Mode''' : Similar to normal restart of the cluster, namenode waits for datanodes
in the cluster to report their blocks. The cluster may wait in the state for a long time if
some of the datanodes do not report their blocks. 
   1. '''Datanode Upgrade''' : Once the most of the blocks are reported, namenode asks the
registered datanodes to start their local upgrade. Namenode waits for for ''all'' the datanodes
to complete their upgrade.
   1. '''Deleting {{{.crc}}} files''' : Namenode deletes {{{.crc}}} files that were previously
used for storing checksum.
  
- === Monitoring the Upgrade ===
+ == Monitoring the Upgrade ==
  
  The cluster stays in ''safeMode'' until the upgrade is complete. HDFS webui is a good place
to check if safeMode is on or off. As always log files from ''namenode'' and ''datanode''
are useful when nothing else helps.
  
- Once the cluster is started with {{{-upgrade}}} option, the simplest way to monitor the
upgrade is with '{{{dfsadmin -upgradeProgress status}}}' command. A typical output from this
command looks like this: {{{
+ Once the cluster is started with {{{-upgrade}}} option, the simplest way to monitor the
upgrade is with '{{{dfsadmin -upgradeProgress status}}}' command. 
+ 
+ === First Stage : Safe Mode ===
+ 
+ The actual Block CRC upgrade starts after all or most of the datanodes have reported their
blocks. {{{ 
+ $ bin/hadoop dfsadmin -upgradeProgress status
+ Distributed upgrade for version -6 is in progress. Status = 0%
+ 
+         Upgrade has not been started yet.
+         Last Block Level Stats updated at : Thu Jan 01 00:00:00 UTC 1970
+         ....
+ }}}
+ The message {{{Upgrade has not been started yet}}} indicates that namenode is in the first
stage. When ''status'' is at 0%, usually it is in this stage. If some datanodes don't start,
check HDFS webui to find which datanodes are listed under ''Dead Nodes'' table.
+  
+ === Second Stage : Datanode Upgrade ===
+ 
+ During this stage a typical output from {{{upgradeProgress}}} command looks like this: {{{
  $ bin/hadoop dfsadmin -upgradeProgress status
  Distributed upgrade for version -6 is in progress. Status = 78%
  
@@ -59, +73 @@

     * {{{Un-upgraded}}} : blocks with zero upgraded replicas.
   * {{{Brief Datanode Status}}} : Each datanode reports its progress to the namenode during
the upgrade. This shows average of percent completion on all the datanodes. This also shows
how many datanodes have completed their upgrade. For the upgrade to proceed to next stage,
all the datanodes should report completion of their local upgrade.
  
+ Note that in some cases, a few blocks might be ''over-replicated'' in such cases, upgrade
might proceed to next stage even if some of the datanodes do not complete their upgrade. If
{{{Fully Upgraded}}} is calculated to be 100%, namenode will proceed to next stage.
+ 
+ ==== Potential Problems during Second Stage ====
+  * ''The upgrade might seem to be stuck'' : Each datanode reports its progress once every
minute. If the percent completion does not change change even afeter a few minutes, some datanodes
might have some unexpected problems. Use {{{details}}} option with {{{-upgradeProgress}}}
command to check which datanodes seem stagnant. {{{
+ $ bin/hadoop dfsadmin -upgradeProgress details
+ Distributed upgrade for version -6 is in progress. Status = 72%
+ 
+         Last Block Level Stats updated at : Thu Jan 01 00:00:00 UTC 1970
+         Last Block Level Stats : Total Blocks : 0
+                                  Fully Upgragraded : 0.00%
+                                  Minimally Upgraded : 0.00%
+                                  Under Upgraded : 0.00% (includes Un-upgraded blocks)
+                                  Un-upgraded : 0.00%
+                                  Errors : 0
+         Brief Datanode Status  : Avg completion of all Datanodes: 81.90% with 0 errors.
+                                  352 out of 893 nodes are not done.
+ 
+         Datanode Stats (total: 893): pct Completion(%) blocks upgraded (u) blocks remaining
(r) errors (e)
+ 
+                 192.168.0.31:50010        : 54 %         2136 u  1804 r  0 e
+                 192.168.0.136:50010       : 73 %         3074 u  1085 r  0 e
+                 192.168.0.24:50010        : 50 %         2044 u  1999 r  0 e
+                 192.168.0.214:50010       : 100 %        4678 u  0 r     0 e
+                 ...
+ }}} You can run this command through '{{{grep -v "100 %"}}}' to find the nodes that have
not completed their upgrade. If the problem nodes can not be corrected, as a last resort you
can check ''Block Level Stats'' to see if the upgrade can be ''forced'' to next stage. E.g.
if 98% are fully-upgraded and 2% minimally-upgraded, then you can reasonably sure that at
least one copy of a block is upgraded. You can force next stage with {{{force}}} option :
{{{
+ $ bin/hadoop dfsadmin -upgradeProgress force
+ Distributed upgrade for version -6 is in progress. Status = 90%
+ 
+         Force Proceed is ON
+         Last Block Level Stats updated at : Mon Aug 13 22:43:31 UTC 2007
+         Last Block Level Stats : Total Blocks : 1054713
+                                  Fully Upgragraded : 99.40%
+                                  Minimally Upgraded : 0.60%
+                                  Under Upgraded : 0.00% (includes Un-upgraded blocks)
+                                  Un-upgraded : 0.00%
+                                  Errors : 0
+         Brief Datanode Status  : Avg completion of all Datanodes: 99.89% with 0 errors.
+                                  1 out of 893 nodes are not done.
+         NOTE: Upgrade at the Datanodes has finished. Deleteing ".crc" files
+         can take longer than status implies.   
+ }}} Note {{{Force Proceed is ON}}} in the status message.
+ 
+ === Third Stage : Deleting {{{.crc}}} files ===
+ Once the second stage is complete, Namenode reports 90% completiong. It does not have a
very good way of estimating time required for deleting the files. The ''status'' reports 90%
completion all through this stage. Later tests with larger number of files indicates that
it takes one hour to delete 2 million files on a rack server. The upgrade status report looks
like the following. {{{
+ $ bin/hadoop dfsadmin -upgradeProgress status
+ Distributed upgrade for version -6 is in progress. Status = 90%
+ 
+         Last Block Level Stats updated at : Mon Aug 20 20:24:56 UTC 2007
+         Last Block Level Stats : Total Blocks : 11604180
+                                  Fully Upgragraded : 100.00%
+                                  Minimally Upgraded : 0.00%
+                                  Under Upgraded : 0.00% (includes Un-upgraded blocks)
+                                  Un-upgraded : 0.00%
+                                  Errors : 0
+         Brief Datanode Status  : Avg completion of all Datanodes: 100.00% with 0 errors.
+         NOTE: Upgrade at the Datanodes has finished. Deleteing ".crc" files
+         can take longer than status implies.
+ }}} Note the last two lines that inform that Namenode is currently deleting {{{.crc}}} files.
+ 
+ === Upgrade is Finally Complete ===
+ Once the upgrade is complete, ''safeMode'' will be turned off and HDFS runs normally. There
is no need to restart the cluster. Now enjoy the new and shiny Hadoop with leaner Namenode.
{{{
+ $ bin/hadoop dfsadmin -upgradeProgress status
+ There are no distributed upgrades in progress.
+ }}}
+ 
+ === Memory requirements ===
+ 
+ HDFS nodes do not require more memory during the upgrade than for normal operation before
the upgrade. We observed that Namenode might use 5-10% more memory (or more GC in JVM) during
the upgrade. If the namenode was operating at the edge of its memory limits during the upgrade,
it could potentially have some problems. At any time, cluster can be restarted and the HDFS
resumes the upgrade.
+ 
+ === Restarting a cluster ===
+ 
+ The cluster can be restarted during any stage of the upgrade and it will resume the upgrade.
+ 
+ === Analyzing Log Files ===
+ 
+ As a last resort while diagnosing problems, administrator could look at logs at Namenode
and Datanode. It might be information overload to list all the relevant log messages here.
Of course, developers most appreciate if the relevant logs are attached while reporting problems
with the upgrade, along with output from {{{-upgradeProgress}}} command.
+ 
+ Some of the warnings on log files are expected during the upgrade. For e.g. during the upgrade,
datanodes fetch checksum data located on their peers. These data transfers utilize the new
protocols in Hadoop-0.14 that require checksum data to be present along with block data. Since
the checksum data is not yet located next to the block you will see the following warning
int the datanode logs : {{{
+ 2007-08-18 07:17:38,698 WARN org.apache.hadoop.dfs.DataNode: Could not find metadata file
for blk_2214836660875523305
+ }}}
+ 
+  
+ 

Mime
View raw message