hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Edward Capriolo <edlinuxg...@gmail.com>
Subject Re: what will happen if a backup name node folder becomes unaccessible?
Date Fri, 27 Aug 2010 23:57:55 GMT
On Tue, Aug 24, 2010 at 7:59 PM, Sudhir Vallamkondu
<Sudhir.Vallamkondu@icrossing.com> wrote:
> The cloudera distribution seems to be working fine when a dfs.name.dir
> directory is inaccessible in midst of namenode running.
>
> See below
>
> hadoop@training-vm:~$ hadoop version
> Hadoop 0.20.1+152
> Subversion  -r c15291d10caa19c2355f437936c7678d537adf94
> Compiled by root on Mon Nov  2 05:15:37 UTC 2009
>
> hadoop@training-vm:~$ jps
> 8923 Jps
> 8548 JobTracker
> 8467 SecondaryNameNode
> 8250 NameNode
> 8357 DataNode
> 8642 TaskTracker
>
> hadoop@training-vm:~$ /usr/lib/hadoop/bin/stop-all.sh
> stopping jobtracker
> localhost: stopping tasktracker
> stopping namenode
> localhost: stopping datanode
> localhost: stopping secondarynamenode
>
> hadoop@training-vm:~$ mkdir edit_log_dir1
>
> hadoop@training-vm:~$ mkdir edit_log_dir2
>
> hadoop@training-vm:~$ ls
> edit_log_dir1  edit_log_dir2
>
> hadoop@training-vm:~$ ls -ltr /var/lib/hadoop-0.20/cache/hadoop/dfs/name
> total 8
> drwxr-xr-x 2 hadoop hadoop 4096 2009-10-15 16:17 image
> drwxr-xr-x 2 hadoop hadoop 4096 2010-08-24 15:56 current
>
> hadoop@training-vm:~$ cp -r /var/lib/hadoop-0.20/cache/hadoop/dfs/name
> edit_log_dir1
>
> hadoop@training-vm:~$ cp -r /var/lib/hadoop-0.20/cache/hadoop/dfs/name
> edit_log_dir2
>
> ------ hdfs-site.xml added new dirs
>
> <?xml version="1.0"?>
> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
>
> <configuration>
>  <property>
>    <name>dfs.replication</name>
>    <value>1</value>
>  </property>
>  <property>
>     <name>dfs.permissions</name>
>     <value>false</value>
>  </property>
>  <property>
>     <!-- specify this so that running 'hadoop namenode -format' formats the
> right dir -->
>     <name>dfs.name.dir</name>
> <value>/var/lib/hadoop-0.20/cache/hadoop/dfs/name,/home/hadoop/edit_log_dir1
> ,/home/hadoop/edit_log_dir2</value>
>  </property>
>   <property>
>     <name>fs.checkpoint.period</name>
>     <value>600</value>
>  </property>
>  <property>
>    <name>dfs.namenode.plugins</name>
>    <value>org.apache.hadoop.thriftfs.NamenodePlugin</value>
>  </property>
>  <property>
>    <name>dfs.datanode.plugins</name>
>    <value>org.apache.hadoop.thriftfs.DatanodePlugin</value>
>  </property>
>  <property>
>    <name>dfs.thrift.address</name>
>    <value>0.0.0.0:9090</value>
>  </property>
> </configuration>
>
> ---- start all daemons
>
> hadoop@training-vm:~$ /usr/lib/hadoop/bin/start-all.sh
> starting namenode, logging to
> /usr/lib/hadoop/bin/../logs/hadoop-hadoop-namenode-training-vm.out
> localhost: starting datanode, logging to
> /usr/lib/hadoop/bin/../logs/hadoop-hadoop-datanode-training-vm.out
> localhost: starting secondarynamenode, logging to
> /usr/lib/hadoop/bin/../logs/hadoop-hadoop-secondarynamenode-training-vm.out
> starting jobtracker, logging to
> /usr/lib/hadoop/bin/../logs/hadoop-hadoop-jobtracker-training-vm.out
> localhost: starting tasktracker, logging to
> /usr/lib/hadoop/bin/../logs/hadoop-hadoop-tasktracker-training-vm.out
>
>
> -------- namenode log confirms all dirs taken
>
> 2010-08-24 16:20:48,718 INFO
> org.apache.hadoop.hdfs.server.namenode.NameNode: STARTUP_MSG:
> /************************************************************
> STARTUP_MSG: Starting NameNode
> STARTUP_MSG:   host = training-vm/127.0.0.1
> STARTUP_MSG:   args = []
> STARTUP_MSG:   version = 0.20.1+152
> STARTUP_MSG:   build =  -r c15291d10caa19c2355f437936c7678d537adf94;
> compiled by 'root' on Mon Nov  2 05:15:37 UTC 2009
> ************************************************************/
> 2010-08-24 16:20:48,815 INFO org.apache.hadoop.ipc.metrics.RpcMetrics:
> Initializing RPC Metrics with hostName=NameNode, port=8022
> 2010-08-24 16:20:48,819 INFO
> org.apache.hadoop.hdfs.server.namenode.NameNode: Namenode up at:
> localhost/127.0.0.1:8022
> 2010-08-24 16:20:48,821 INFO org.apache.hadoop.metrics.jvm.JvmMetrics:
> Initializing JVM Metrics with processName=NameNode, sessionId=null
> 2010-08-24 16:20:48,822 INFO
> org.apache.hadoop.hdfs.server.namenode.metrics.NameNodeMetrics: Initializing
> NameNodeMeterics using context
> object:org.apache.hadoop.metrics.spi.NoEmitMetricsContext
> 2010-08-24 16:20:48,894 INFO
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: fsOwner=hadoop,hadoop
> 2010-08-24 16:20:48,894 INFO
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: supergroup=supergroup
> 2010-08-24 16:20:48,894 INFO
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
> isPermissionEnabled=false
> 2010-08-24 16:20:48,903 INFO
> org.apache.hadoop.hdfs.server.namenode.metrics.FSNamesystemMetrics:
> Initializing FSNamesystemMetrics using context
> object:org.apache.hadoop.metrics.spi.NoEmitMetrics
> Context
> 2010-08-24 16:20:48,905 INFO
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Registered
> FSNamesystemStatusMBean
> 2010-08-24 16:20:48,937 INFO org.apache.hadoop.hdfs.server.common.Storage:
> Storage directory /home/hadoop/edit_log_dir1 is not formatted.
> 2010-08-24 16:20:48,937 INFO org.apache.hadoop.hdfs.server.common.Storage:
> Formatting ...
> 2010-08-24 16:20:48,937 INFO org.apache.hadoop.hdfs.server.common.Storage:
> Storage directory /home/hadoop/edit_log_dir2 is not formatted.
> 2010-08-24 16:20:48,937 INFO org.apache.hadoop.hdfs.server.common.Storage:
> Formatting ...
> 2010-08-24 16:20:48,938 INFO org.apache.hadoop.hdfs.server.common.Storage:
> Number of files = 41
> 2010-08-24 16:20:48,947 INFO org.apache.hadoop.hdfs.server.common.Storage:
> Number of files under construction = 0
> 2010-08-24 16:20:48,947 INFO org.apache.hadoop.hdfs.server.common.Storage:
> Image file of size 4357 loaded in 0 seconds.
>
> ---- directories confirm in use
>
> hadoop@training-vm:~$ ls -ltr edit_log_dir1
> total 12
> drwxr-xr-x 4 hadoop hadoop 4096 2010-08-24 16:01 name
> -rw-r--r-- 1 hadoop hadoop    0 2010-08-24 16:20 in_use.lock
> drwxr-xr-x 2 hadoop hadoop 4096 2010-08-24 16:20 image
> drwxr-xr-x 2 hadoop hadoop 4096 2010-08-24 16:20 current
>
> hadoop@training-vm:~$ ls -ltr edit_log_dir2
> total 12
> drwxr-xr-x 4 hadoop hadoop 4096 2010-08-24 16:01 name
> -rw-r--r-- 1 hadoop hadoop    0 2010-08-24 16:20 in_use.lock
> drwxr-xr-x 2 hadoop hadoop 4096 2010-08-24 16:20 image
> drwxr-xr-x 2 hadoop hadoop 4096 2010-08-24 16:20 current
>
> ----- secondary name node checkpoint worked fine
>
> 2010-08-24 16:27:10,555 INFO
> org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: Downloaded file
> fsimage size 4357 bytes.
> 2010-08-24 16:27:10,557 INFO
> org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: Downloaded file
> edits size 895 bytes.
> 2010-08-24 16:27:10,603 INFO
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: fsOwner=hadoop,hadoop
> 2010-08-24 16:27:10,603 INFO
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: supergroup=supergroup
> 2010-08-24 16:27:10,603 INFO
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
> isPermissionEnabled=false
> 2010-08-24 16:27:10,622 INFO org.apache.hadoop.hdfs.server.common.Storage:
> Number of files = 41
> 2010-08-24 16:27:10,629 INFO org.apache.hadoop.hdfs.server.common.Storage:
> Number of files under construction = 0
> 2010-08-24 16:27:10,635 INFO org.apache.hadoop.hdfs.server.common.Storage:
> Edits file /var/lib/hadoop-0.20/cache/hadoop/dfs/namesecondary/current/edits
> of size 895 edits # 10 loaded in 0 seconds
> .
> 2010-08-24 16:27:10,658 INFO org.apache.hadoop.hdfs.server.common.Storage:
> Image file of size 4461 saved in 0 seconds.
> 2010-08-24 16:27:10,745 INFO
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Number of transactions:
> 0 Total time for transactions(ms): 0Number of transactions batched in Syncs:
> 0 Number of syncs: 0 SyncTimes(ms): 0
> 2010-08-24 16:27:10,756 INFO
> org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: Posted URL
> localhost:50070putimage=1&port=50090&machine=127.0.0.1&token=-18:1431678956:
> 1255648991179:1282692430000:1282692049090
> 2010-08-24 16:27:11,008 WARN
> org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: Checkpoint done.
> New Image Size: 4461
>
> --- dirctory put works fine
>
> hadoop@training-vm:~$ hadoop fs -ls /user/training
> Found 3 items
> drwxr-xr-x   - training supergroup          0 2010-06-30 13:18
> /user/training/grep_output
> drwxr-xr-x   - training supergroup          0 2010-06-30 13:14
> /user/training/input
> drwxr-xr-x   - training supergroup          0 2010-06-30 15:30
> /user/training/output
>
> hadoop@training-vm:~$ hadoop fs -put
> /etc/hadoop/conf.with-desktop/hdfs-site.xml /user/training
>
> hadoop@training-vm:~$ hadoop fs -ls /user/training
> Found 4 items
> drwxr-xr-x   - training supergroup          0 2010-06-30 13:18
> /user/training/grep_output
> -rw-r--r--   1 hadoop   supergroup        987 2010-08-24 16:25
> /user/training/hdfs-site.xml
> drwxr-xr-x   - training supergroup          0 2010-06-30 13:14
> /user/training/input
> drwxr-xr-x   - training supergroup          0 2010-06-30 15:30
> /user/training/output
>
>
> ------ delete one of the directories
> hadoop@training-vm:~$ rm -rf edit_log_dir2
>
> hadoop@training-vm:~$ ls -ltr
> total 4
> drwxr-xr-x 5 hadoop hadoop 4096 2010-08-24 16:20 edit_log_dir1
>
> -- namenode logs
>
> No errors/warns in logs
>
> -------- namenode still running
>
> hadoop@training-vm:~$ jps
> 12426 NameNode
> 12647 SecondaryNameNode
> 12730 JobTracker
> 14090 Jps
> 12535 DataNode
> 12826 TaskTracker
>
> ----  puts and ls work fine
>
> hadoop@training-vm:~$ hadoop fs -ls /user/training
> Found 4 items
> drwxr-xr-x   - training supergroup          0 2010-06-30 13:18
> /user/training/grep_output
> -rw-r--r--   1 hadoop   supergroup        987 2010-08-24 16:25
> /user/training/hdfs-site.xml
> drwxr-xr-x   - training supergroup          0 2010-06-30 13:14
> /user/training/input
> drwxr-xr-x   - training supergroup          0 2010-06-30 15:30
> /user/training/output
>
> hadoop@training-vm:~$ hadoop fs -put
> /etc/hadoop/conf.with-desktop/core-site.xml /user/training
>
> hadoop@training-vm:~$ hadoop fs -put
> /etc/hadoop/conf.with-desktop/mapred-site.xml /user/training
>
> hadoop@training-vm:~$ hadoop fs -ls /user/training
> Found 6 items
> -rw-r--r--   1 hadoop   supergroup        338 2010-08-24 16:28
> /user/training/core-site.xml
> drwxr-xr-x   - training supergroup          0 2010-06-30 13:18
> /user/training/grep_output
> -rw-r--r--   1 hadoop   supergroup        987 2010-08-24 16:25
> /user/training/hdfs-site.xml
> drwxr-xr-x   - training supergroup          0 2010-06-30 13:14
> /user/training/input
> -rw-r--r--   1 hadoop   supergroup        454 2010-08-24 16:29
> /user/training/mapred-site.xml
> drwxr-xr-x   - training supergroup          0 2010-06-30 15:30
> /user/training/output
>
> ------- secondary namenode checkpoint is successdul
>
> 2010-08-24 16:37:11,455 WARN
> org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: Checkpoint done.
> New Image Size: 4671
> ....
> 2010-08-24 16:47:11,884 WARN
> org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: Checkpoint done.
> New Image Size: 4671
> ...
> 2010-08-24 16:57:12,264 WARN
> org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: Checkpoint done.
> New Image Size: 4671
>
> ------- after 30 mins
>
> hadoop@training-vm:~$ jps
> 12426 NameNode
> 12647 SecondaryNameNode
> 12730 JobTracker
> 16256 Jps
> 12535 DataNode
> 12826 TaskTracker
>
>
>
>
>
>
>
>
> On Aug/24/ 11:05 AM, "common-user-digest-help@hadoop.apache.org"
> <common-user-digest-help@hadoop.apache.org> wrote:
>
>> From: jiang licht <licht_jiang@yahoo.com>
>> Date: Tue, 24 Aug 2010 10:38:32 -0700 (PDT)
>> To: <common-user@hadoop.apache.org>
>> Subject: Re: what will happen if a backup name node folder becomes
>> unaccessible?
>>
>> Sudhir,
>>
>> Look forward to your results, if possible with different CDH releases.
>>
>> Thanks,
>>
>> Michael
>>
>> --- On Tue, 8/24/10, Sudhir Vallamkondu <Sudhir.Vallamkondu@icrossing.com>
>> wrote:
>>
>> From: Sudhir Vallamkondu <Sudhir.Vallamkondu@icrossing.com>
>> Subject: Re: what will happen if a backup name node folder becomes
>> unaccessible?
>> To: common-user@hadoop.apache.org
>> Date: Tuesday, August 24, 2010, 10:47 AM
>>
>> Harsh
>>
>> You seem to be getting an ³all storage directories inaccessible² error.
>> Strange coz as per code that gets thrown when all dirs are inaccessible.
>> Anycase will test it on cloudera distribution today and publish results.
>>
>> - Sudhir
>>
>>
>> On Aug/24/ 5:08 AM, "common-user-digest-help@hadoop.apache.org"
>> <common-user-digest-help@hadoop.apache.org> wrote:
>>
>>> From: Harsh J <qwertymaniac@gmail.com>
>>> Date: Tue, 24 Aug 2010 11:41:48 +0530
>>> To: <common-user@hadoop.apache.org>
>>> Subject: Re: common-user Digest 23 Aug 2010 21:21:26 -0000 Issue 1518
>>>
>>> Hello Sudhir,
>>>
>>> You're right about this, but I don't seem to be getting the warning for the
>>> edit log IOException at all in the first place. Here's my steps to get to
>>> what I described earlier (note that am just using two directories on the
>>> same disk, not two different devices or nfs, etc.) Its my personal computer
>>> so I don't mind doing this again for now (as the other directory remains
>>> untouched).
>>>
>>> *hadoop 11:13:00 ~/.hadoop $* jps
>>>
>>> 4954 SecondaryNameNode
>>>
>>> 5911 Jps
>>>
>>> 5158 TaskTracker
>>>
>>> 4592 NameNode
>>>
>>> 5650 JobTracker
>>>
>>> 4768 DataNode
>>>
>>> *hadoop 11:13:02 ~/.hadoop $* hadoop dfs -ls
>>>
>>> Found 2 items
>>>
>>> -rw-r--r--   1 hadoop supergroup     411536 2010-08-18 15:50
>>> /user/hadoop/data
>>> drwxr-xr-x   - hadoop supergroup          0 2010-08-18 16:02
>>> /user/hadoop/dataout
>>> hadoop 11:13:07 ~/.hadoop $ tail -n 10 conf/hdfs-site.xml
>>>
>>>   <property>
>>>
>>>     <name>*dfs.name.dir*</name>
>>>
>>>     <value>/home/hadoop/.dfs/name,*/home/hadoop/.dfs/testdir*</value>
>>>
>>>     <final>true</final>
>>>
>>>   </property>
>>>
>>>   <property>
>>>
>>>     <name>dfs.datanode.max.xcievers</name>
>>>
>>>     <value>2047</value>
>>>
>>>   </property>
>>>
>>> </configuration>
>>>
>>> *hadoop 11:13:25 ~/.hadoop $* ls ~/.dfs/
>>>
>>> data  name  testdir
>>>
>>> *hadoop 11:13:36 ~/.hadoop $ rm -r ~/.dfs/testdir  *
>>>
>>> *hadoop 11:13:49 ~/.hadoop $* jps
>>>
>>> 6135 Jps
>>>
>>> 4954 SecondaryNameNode
>>>
>>> 5158 TaskTracker
>>>
>>> 4592 NameNode
>>>
>>> 5650 JobTracker
>>>
>>> 4768 DataNode
>>>
>>> *hadoop 11:13:56 ~/.hadoop $* hadoop dfs -put /etc/profile profile1
>>>
>>> *hadoop 11:14:10 ~/.hadoop $* hadoop dfs -put /etc/profile profile2
>>>
>>> *hadoop 11:14:12 ~/.hadoop $* hadoop dfs -put /etc/profile profile3
>>>
>>> *hadoop 11:14:15 ~/.hadoop $* hadoop dfs -put /etc/profile profile4
>>>
>>>
>>> *hadoop 11:17:21 ~/.hadoop $* jps
>>> 4954 SecondaryNameNode
>>>
>>> 5158 TaskTracker
>>>
>>> 4592 NameNode
>>>
>>> 5650 JobTracker
>>>
>>> 4768 DataNode
>>>
>>> 6954 Jps
>>>
>>> *hadoop 11:17:23 ~/.hadoop $* tail -f
>>> hadoop-0.20.2/logs/hadoop-hadoop-namenode-hadoop.log
>>> 2010-08-24 11:14:17,632 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
>>> NameSystem.allocateBlock: /user/hadoop/profile4. blk_28644972299224370_1019
>>>
>>> 2010-08-24 11:14:17,709 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
>>> NameSystem.addStoredBlock: blockMap updated: 192.168.1.8:50010 is added to
>>> blk_28644972299224370_1019 size 497
>>> 2010-08-24 11:14:17,713 INFO org.apache.hadoop.hdfs.StateChange: DIR*
>>> NameSystem.completeFile: file /user/hadoop/profile4 is closed by
>>> DFSClient_-2054565417
>>> 2010-08-24 11:17:31,187 INFO
>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Roll Edit Log from
>>> 192.168.1.8
>>>
>>> 2010-08-24 11:17:31,187 INFO
>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Number of transactions:
>>> 19 Total time for transactions(ms): 4Number of transactions batched in
>>> Syncs: 0 Number of syncs: 14 SyncTimes(ms): 183 174
>>>
>>> 2010-08-24 11:17:31,281 FATAL
>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Fatal Error : All
>>> storage directories are inaccessible.
>>>
>>> 2010-08-24 11:17:31,283 INFO
>>> org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG:
>>> /************************************************************
>>>
>>> SHUTDOWN_MSG: Shutting down NameNode at hadoop.cf.net/127.0.0.1
>>>
>>> ************************************************************/
>>>
>>> ^C
>>> *hadoop 11:17:51 ~/.hadoop $* ls /home/hadoop/.dfs/
>>>
>>> data  name
>>> *hadoop 11:21:14 ~/.hadoop $* jps
>>> 8259 Jps
>>>
>>> 4954 SecondaryNameNode
>>>
>>> 5158 TaskTracker
>>>
>>> 5650 JobTracker
>>>
>>> 4768 DataNode
>>> *hadoop 11:36:03 ~/.hadoop $* mkdir ~/.dfs/testdir
>>> *hadoop 11:36:04 ~/.hadoop $ *stop-all.sh
>>> stopping jobtracker
>>>
>>> localhost: stopping tasktracker
>>>
>>> no namenode to stop
>>>
>>> localhost: stopping datanode
>>>
>>> localhost: stopping secondarynamenode
>>> *hadoop 11:37:01 ~/.hadoop $ *start-all.sh
>>> starting namenode, logging to
>>>
> /home/hadoop/.hadoop/hadoop-0.20.2/bin/../logs/hadoop-hadoop-namenode-hadoop.>>
> o
>>> ut
>>>
>>>
>>> localhost: starting datanode, logging to
>>>
> /home/hadoop/.hadoop/hadoop-0.20.2/bin/../logs/hadoop-hadoop-datanode-hadoop.>>
> o
>>> ut
>>>
>>> localhost: starting secondarynamenode, logging to
>>>
> /home/hadoop/.hadoop/hadoop-0.20.2/bin/../logs/hadoop-hadoop-secondarynamenod>>
> e
>>> -hadoop.out
>>>
>>> starting jobtracker, logging to
>>>
> /home/hadoop/.hadoop/hadoop-0.20.2/bin/../logs/hadoop-hadoop-jobtracker-hadoo>>
> p
>>> .out
>>>
>>>
>>> localhost: starting tasktracker, logging to
>>>
> /home/hadoop/.hadoop/hadoop-0.20.2/bin/../logs/hadoop-hadoop-tasktracker-hado>>
> o
>>> p.out
>>> *hadoop 11:39:30 ~/.hadoop $* hadoop dfs -ls
>>> Found 6 items
>>>
>>> -rw-r--r--   1 hadoop supergroup     411536 2010-08-18 15:50
>>> /user/hadoop/data
>>> drwxr-xr-x   - hadoop supergroup          0 2010-08-18 16:02
>>> /user/hadoop/dataout
>>> -rw-r--r--   1 hadoop supergroup        497 2010-08-24 11:14
>>> /user/hadoop/profile1
>>> -rw-r--r--   1 hadoop supergroup        497 2010-08-24 11:14
>>> /user/hadoop/profile2
>>> -rw-r--r--   1 hadoop supergroup        497 2010-08-24 11:14
>>> /user/hadoop/profile3
>>> -rw-r--r--   1 hadoop supergroup        497 2010-08-24 11:14
>>> /user/hadoop/profile4
>>>
>>>
>>>
>>> On Tue, Aug 24, 2010 at 10:49 AM, Sudhir Vallamkondu <
>>> Sudhir.Vallamkondu@icrossing.com> wrote:
>>>>> Looking at the codebase it seems to suggest that it ignores a editlog
>>>>> storage directory if it encounters an error
>>>>>
>>>>>
>>> http://www.google.com/codesearch/p?hl=en#GLh8vwsjDqs/trunk/src/hdfs/org/apac
>>>>>
>>> he/hadoop/hdfs/server/namenode/FSEditLog.java&q=namenode%20editlog&sa=N&cd=2
>>>>> 0&ct=rc
>>>>>
>>>>> Check lines:
>>>>> Code in line 334
>>>>> comment: 387 - 390
>>>>> comment: 411 - 414
>>>>> Comment: 433 - 436
>>>>>
>>>>> The processIOError method is called throughout the code if it encounters
>>> an
>>>>> IOException.
>>>>>
>>>>> A fatal error is only thrown if none of the storage directories is
>>>>> accessible. Lines 394, 420
>>>>>
>>>>> - Sudhir
>>>>>
>>>>>
>>>>>
>>>>> On Aug/23/ 2:21 PM, "common-user-digest-help@hadoop.apache.org"
>>>>> <common-user-digest-help@hadoop.apache.org> wrote:
>>>>>
>>>>>>> From: Michael Segel <michael_segel@hotmail.com>
>>>>>>> Date: Mon, 23 Aug 2010 14:05:05 -0500
>>>>>>> To: <common-user@hadoop.apache.org>
>>>>>>> Subject: RE: what will happen if a backup name node folder becomes
>>>>>>> unaccessible?
>>>>>>>
>>>>>>>
>>>>>>> Ok...
>>>>>>>
>>>>>>> Now you have me confused.
>>>>>>> Everything we've seen says that writing to both a local disk
and to an
>>> NFS
>>>>>>> mounted disk would be the best way to prevent a problem.
>>>>>>>
>>>>>>> Now you and Harsh J say that this could actually be problematic.
>>>>>>>
>>>>>>> Which is it?
>>>>>>> Is this now a defect that should be addressed, or should we just
not use
>>> an
>>>>>>> NFS mounted drive?
>>>>>>>
>>>>>>> Thx
>>>>>>>
>>>>>>> -Mike
>>>>>>>
>>>>>>>
>>>>>>>>> Date: Mon, 23 Aug 2010 11:42:59 -0700
>>>>>>>>> From: licht_jiang@yahoo.com
>>>>>>>>> Subject: Re: what will happen if a backup name node folder
becomes
>>>>>>>>> unaccessible?
>>>>>>>>> To: common-user@hadoop.apache.org
>>>>>>>>>
>>>>>>>>> This makes a good argument. Actually, after seeing the
previous reply,
>> I
>>>>>>>>> kindof convinced that I should go back to "sync" the
meta data to a
>>> backup
>>>>>>>>> location instead of using this feature, which as David
mentioned,
>>> introduced
>>>>>>>>> a 2nd single point of failure to hadoop, which degrades
the
>>>>>> availability
>>> of
>>>>>>>>> hadoop. BTW, we are using cloudera package hadoop-0.20.2+228.
Can
>>> someone
>>>>>>>>> confirm whether a name node will shut down given that
a backup folder
>>> listed
>>>>>>>>> in "dfs.name.dir" becomes unavailable in this version?
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>>
>>>>>>>>> Michael
>>>>>>>>>
>>>>>>>>> --- On Sun, 8/22/10, David B. Ritch <david.ritch@gmail.com>
wrote:
>>>>>>>>>
>>>>>>>>> From: David B. Ritch <david.ritch@gmail.com>
>>>>>>>>> Subject: Re: what will happen if a backup name node folder
becomes
>>>>>>>>> unaccessible?
>>>>>>>>> To: common-user@hadoop.apache.org
>>>>>>>>> Date: Sunday, August 22, 2010, 11:34 PM
>>>>>>>>>
>>>>>>>>>   Which version of Hadoop was this?  The folks at Cloudera
have assured
>>>>>>>>> me that the namenode in CDH2 will continue as long as
one of the
>>>>>>>>> directories is still writable.
>>>>>>>>>
>>>>>>>>> It *does* seem a bit of a waste if an availability feature
- the
>>>>>> ability
>>>>>>>>> to write to multiple directories - actually reduces availability
by
>>>>>>>>> providing an additional single point of failure.
>>>>>>>>>
>>>>>>>>> Thanks!
>>>>>>>>>
>>>>>>>>> dbr
>>>>>>>>>
>>>>>>>>> On 8/20/2010 5:27 PM, Harsh J wrote:
>>>>>>>>>>> Whee, lets try it out:
>>>>>>>>>>>
>>>>>>>>>>> Start with both paths available. ... Starts fine.
>>>>>>>>>>> Store some files. ... Works.
>>>>>>>>>>> rm -r the second path. ... Ouch.
>>>>>>>>>>> Store some more files. ... Still Works. [Cuz
the SNN hasn't sent us
>>>>>>>>>>> stuff back yet]
>>>>>>>>>>> Wait for checkpoint to hit.
>>>>>>>>>>> And ...
>>>>>>>>>>> Boom!
>>>>>>>>>>>
>>>>>>>>>>> 2010-08-21 02:42:00,385 INFO
>>>>>>>>>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
Roll Edit Log
>>>>>>>>>>> from 127.0.0.1
>>>>>>>>>>> 2010-08-21 02:42:00,385 INFO
>>>>>>>>>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
Number of
>>>>>>>>>>> transactions: 37 Total time for transactions(ms):
6Number of
>>>>>>>>>>> transactions batched in Syncs: 0 Number of syncs:
26 SyncTimes(ms):
>>>>>>>>>>> 307 277
>>>>>>>>>>> 2010-08-21 02:42:00,439 FATAL
>>>>>>>>>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
Fatal Error :
>> All
>>>>>>>>>>> storage directories are inaccessible.
>>>>>>>>>>> 2010-08-21 02:42:00,440 INFO
>>>>>>>>>>> org.apache.hadoop.hdfs.server.namenode.NameNode:
SHUTDOWN_MSG:
>>>>>>>>>>> /************************************************************
>>>>>>>>>>> SHUTDOWN_MSG: Shutting down NameNode at localhost/127.0.0.1
>>>>>>>>>>> ************************************************************/
>>>>>>>>>>>
>>>>>>>>>>> So yes, as Edward says - never let this happen!
>>>>>>>>>>>
>>>>>>>>>>> On Sat, Aug 21, 2010 at 2:26 AM, jiang licht
<licht_jiang@yahoo.com>
>>> wrote:
>>>>>>>>>>>>> Using nfs folder to back up dfs meta
information as follows,
>>>>>>>>>>>>>
>>>>>>>>>>>>> <property>
>>>>>>>>>>>>>          <name>dfs.name.dir</name>
>>>>>>>>>>>>>          <value>/hadoop/dfs/name,/hadoop-backup/dfs/name</value>
>>>>>>>>>>>>>      </property>
>>>>>>>>>>>>>
>>>>>>>>>>>>> where /hadoop-backup is on a backup machine
and mounted on the
>>>>>>>> master
>>> node.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I have a question: if somehow, the backup
folder becomes
>>>>>>>> unavailable,
>>> will
>>>>>>>>>>>>> it freeze master node? That is, will
write operation simply hang up
>> on
>>> this
>>>>>>>>>>>>> condition on the master node? Or will
master node log the problem
>> and
>>>>>>>>>>>>> continues to work?
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>
>>>>>>>>>>>>> Michael
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>
>>>>>
>>>>> iCrossing Privileged and Confidential Information
>>>>> This email message is for the sole use of the intended recipient(s) and
>>> may contain confidential and privileged information of iCrossing. Any
>>> unauthorized review, use, disclosure or distribution is prohibited. If you
>>> are not the intended recipient, please contact the sender by reply email and
>>> destroy all copies of the original message.
>>>>>
>>>>>
>>>>>
>>>
>>> Above steps were done performed using Apache Hadoop 0.20.2. Not cloudera's
>>> version of it, if that helps.
>>>
>>> --
>>> Harsh J
>>> www.harshj.com
>>
>
>



You can keep running...but you cant restart


[edward@ec hadoop-0.20.2]$
/home/edward/hadoop/hadoop-0.20.2/bin/hadoop namenode -format
10/08/27 19:51:03 INFO namenode.NameNode: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG:   host = ec.media6.deg/127.0.0.1
STARTUP_MSG:   args = [-format]
STARTUP_MSG:   version = 0.20.2
STARTUP_MSG:   build =
https://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20 -r
911707; compiled by 'chrisdo' on Fri Feb 19 08:07:34 UTC 2010
************************************************************/
10/08/27 19:51:04 INFO namenode.FSNamesystem: fsOwner=edward,edward
10/08/27 19:51:04 INFO namenode.FSNamesystem: supergroup=supergroup
10/08/27 19:51:04 INFO namenode.FSNamesystem: isPermissionEnabled=true
10/08/27 19:51:04 INFO common.Storage: Image file of size 96 saved in 0 seconds.
10/08/27 19:51:04 INFO common.Storage: Storage directory /tmp/two has
been successfully formatted.
10/08/27 19:51:04 INFO common.Storage: Image file of size 96 saved in 0 seconds.
10/08/27 19:51:04 INFO common.Storage: Storage directory /tmp/one has
been successfully formatted.
10/08/27 19:51:04 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at ec.media6.deg/127.0.0.1
************************************************************/
[edward@ec hadoop-0.20.2]$
/home/edward/hadoop/hadoop-0.20.2/bin/hadoop namenode
10/08/27 19:51:13 INFO namenode.NameNode: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG:   host = ec.media6.deg/127.0.0.1
STARTUP_MSG:   args = []
STARTUP_MSG:   version = 0.20.2
STARTUP_MSG:   build =
https://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20 -r
911707; compiled by 'chrisdo' on Fri Feb 19 08:07:34 UTC 2010
************************************************************/
10/08/27 19:51:14 INFO metrics.RpcMetrics: Initializing RPC Metrics
with hostName=NameNode, port=9000
10/08/27 19:51:14 INFO namenode.NameNode: Namenode up at:
localhost/127.0.0.1:9000
10/08/27 19:51:14 INFO jvm.JvmMetrics: Initializing JVM Metrics with
processName=NameNode, sessionId=null
10/08/27 19:51:14 INFO metrics.NameNodeMetrics: Initializing
NameNodeMeterics using context
object:org.apache.hadoop.metrics.spi.NullContext
10/08/27 19:51:14 INFO namenode.FSNamesystem: fsOwner=edward,edward
10/08/27 19:51:14 INFO namenode.FSNamesystem: supergroup=supergroup
10/08/27 19:51:14 INFO namenode.FSNamesystem: isPermissionEnabled=true
10/08/27 19:51:14 INFO metrics.FSNamesystemMetrics: Initializing
FSNamesystemMetrics using context
object:org.apache.hadoop.metrics.spi.NullContext
10/08/27 19:51:14 INFO namenode.FSNamesystem: Registered FSNamesystemStatusMBean
10/08/27 19:51:14 INFO common.Storage: Number of files = 1
10/08/27 19:51:14 INFO common.Storage: Number of files under construction = 0
10/08/27 19:51:14 INFO common.Storage: Image file of size 96 loaded in
0 seconds.
10/08/27 19:51:14 INFO common.Storage: Edits file
/tmp/two/current/edits of size 4 edits # 0 loaded in 0 seconds.
10/08/27 19:51:14 INFO common.Storage: Image file of size 96 saved in 0 seconds.
10/08/27 19:51:14 INFO common.Storage: Image file of size 96 saved in 0 seconds.
10/08/27 19:51:14 INFO namenode.FSNamesystem: Finished loading FSImage
in 799 msecs
10/08/27 19:51:14 INFO namenode.FSNamesystem: Total number of blocks = 0
10/08/27 19:51:14 INFO namenode.FSNamesystem: Number of invalid blocks = 0
10/08/27 19:51:14 INFO namenode.FSNamesystem: Number of
under-replicated blocks = 0
10/08/27 19:51:14 INFO namenode.FSNamesystem: Number of
over-replicated blocks = 0
10/08/27 19:51:14 INFO hdfs.StateChange: STATE* Leaving safe mode after 0 secs.
10/08/27 19:51:14 INFO hdfs.StateChange: STATE* Network topology has 0
racks and 0 datanodes
10/08/27 19:51:14 INFO hdfs.StateChange: STATE* UnderReplicatedBlocks
has 0 blocks
10/08/27 19:51:15 INFO mortbay.log: Logging to
org.slf4j.impl.Log4jLoggerAdapter(org.mortbay.log) via
org.mortbay.log.Slf4jLog
10/08/27 19:51:15 INFO http.HttpServer: Port returned by
webServer.getConnectors()[0].getLocalPort() before open() is -1.
Opening the listener on 50070
10/08/27 19:51:15 INFO http.HttpServer: listener.getLocalPort()
returned 50070 webServer.getConnectors()[0].getLocalPort() returned
50070
10/08/27 19:51:15 INFO http.HttpServer: Jetty bound to port 50070
10/08/27 19:51:15 INFO mortbay.log: jetty-6.1.14
10/08/27 19:51:15 INFO mortbay.log: Started SelectChannelConnector@0.0.0.0:50070
10/08/27 19:51:15 INFO namenode.NameNode: Web-server up at: 0.0.0.0:50070
10/08/27 19:51:15 INFO ipc.Server: IPC Server Responder: starting



[edward@ec tmp]$ /home/edward/hadoop/hadoop-0.20.2/bin/hadoop dfs -mkdir /one
[edward@ec tmp]$ /home/edward/hadoop/hadoop-0.20.2/bin/hadoop dfs -mkdir /two

[edward@ec tmp]$ ls /tmp/one/
current  image  in_use.lock
[edward@ec tmp]$ ls /tmp/two/
current  image  in_use.lock

[edward@ec tmp]$ rm -rf /tmp/two
[edward@ec tmp]$ /home/edward/hadoop/hadoop-0.20.2/bin/hadoop dfs -mkdir /three


[edward@ec tmp]$ /home/edward/hadoop/hadoop-0.20.2/bin/hadoop dfs -ls /
Found 4 items
drwxr-xr-x   - edward supergroup          0 2010-08-27 19:54 /four
drwxr-xr-x   - edward supergroup          0 2010-08-27 19:51 /one
drwxr-xr-x   - edward supergroup          0 2010-08-27 19:53 /three
drwxr-xr-x   - edward supergroup          0 2010-08-27 19:51 /two

^C10/08/27 19:55:31 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at ec.media6.deg/127.0.0.1
************************************************************/


[edward@ec hadoop-0.20.2]$
/home/edward/hadoop/hadoop-0.20.2/bin/hadoop namenode
10/08/27 19:56:10 INFO namenode.NameNode: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG:   host = ec.media6.deg/127.0.0.1
STARTUP_MSG:   args = []
STARTUP_MSG:   version = 0.20.2
STARTUP_MSG:   build =
https://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20 -r
911707; compiled by 'chrisdo' on Fri Feb 19 08:07:34 UTC 2010
************************************************************/
10/08/27 19:56:11 INFO metrics.RpcMetrics: Initializing RPC Metrics
with hostName=NameNode, port=9000
10/08/27 19:56:11 INFO namenode.NameNode: Namenode up at:
localhost/127.0.0.1:9000
10/08/27 19:56:11 INFO jvm.JvmMetrics: Initializing JVM Metrics with
processName=NameNode, sessionId=null
10/08/27 19:56:11 INFO metrics.NameNodeMetrics: Initializing
NameNodeMeterics using context
object:org.apache.hadoop.metrics.spi.NullContext
10/08/27 19:56:11 INFO namenode.FSNamesystem: fsOwner=edward,edward
10/08/27 19:56:11 INFO namenode.FSNamesystem: supergroup=supergroup
10/08/27 19:56:11 INFO namenode.FSNamesystem: isPermissionEnabled=true
10/08/27 19:56:11 INFO metrics.FSNamesystemMetrics: Initializing
FSNamesystemMetrics using context
object:org.apache.hadoop.metrics.spi.NullContext
10/08/27 19:56:11 INFO namenode.FSNamesystem: Registered FSNamesystemStatusMBean
10/08/27 19:56:11 INFO common.Storage: Storage directory /tmp/two does
not exist.
10/08/27 19:56:11 ERROR namenode.FSNamesystem: FSNamesystem
initialization failed.
org.apache.hadoop.hdfs.server.common.InconsistentFSStateException:
Directory /tmp/two is in an inconsistent state: storage directory does
not exist or is not accessible.
	at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:290)
	at org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:87)
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:311)
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.java:292)
	at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:201)
	at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:279)
	at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:956)
	at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:965)
10/08/27 19:56:11 INFO ipc.Server: Stopping server on 9000
10/08/27 19:56:11 ERROR namenode.NameNode:
org.apache.hadoop.hdfs.server.common.InconsistentFSStateException:
Directory /tmp/two is in an inconsistent state: storage directory does
not exist or is not accessible.
	at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:290)
	at org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:87)
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:311)
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.java:292)
	at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:201)
	at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:279)
	at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:956)
	at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:965)

10/08/27 19:56:11 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at ec.media6.deg/127.0.0.1
************************************************************/

Mime
View raw message