hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nathan Harkenrider <nathan.harkenri...@gmail.com>
Subject Re: what will happen if a backup name node folder becomes unaccessible?
Date Fri, 27 Aug 2010 21:36:48 GMT
I ran into issues testing on a standalone Hadoop install (Cloudera
Distribution 0.20.1+169.89) where dfs.name.dir was configured to write to a
local disk as well as a NFS mount. If the NFS mount becomes unavailable
while the namenode is running, it logs the following errors then becomes
unresponsive (the last error was my attempt to write a file). Ideally, I'd
like to write nameinfo to an NFS mount in addition to local disks, but based
on these results, I'm hesitant to due to the likelihood of introduction a
new point of failure into the system.

2010-08-27 13:09:44,660 ERROR
org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Unable to sync edit
log. Fatal Error.
2010-08-27 13:09:44,661 INFO org.apache.hadoop.hdfs.server.common.Storage:
 removing /hadoop/nameinfo
2010-08-27 13:09:44,662 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 1 on 8020, call create(/user/hadoop/global-groovy.log, rwxr-xr-x,
DFSClient_202163196, false, 1, 67108864) from 127.0.0.1:44182: error:
java.io.IOException: java.lang.IndexOutOfBoundsException: Index: 1, Size: 1
java.io.IOException: java.lang.IndexOutOfBoundsException: Index: 1, Size: 1
        at java.util.ArrayList.RangeCheck(ArrayList.java:547)
        at java.util.ArrayList.get(ArrayList.java:322)
        at
org.apache.hadoop.hdfs.server.namenode.FSEditLog.logSync(FSEditLog.java:906)
        at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:984)
        at
org.apache.hadoop.hdfs.server.namenode.NameNode.create(NameNode.java:407)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:512)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:966)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:962)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:396)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:960)

It also appears that the namenode will fail to start when the NFS mount is
unavailable.

2010-08-27 13:58:02,329 ERROR
org.apache.hadoop.hdfs.server.namenode.FSNamesystem: FSNamesystem
initialization failed.
org.apache.hadoop.hdfs.server.common.InconsistentFSStateException: Directory
/hadoop/nameinfo is in an inconsistent state: storage directory does not
exist or is not a
ccessible.
        at
org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:290)
        at
org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:88)
        at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:312)
        at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.java:293)
        at
org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:224)
        at
org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:306)
        at
org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1006)
        at
org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1015)
2010-08-27 13:58:02,331 INFO org.apache.hadoop.ipc.Server: Stopping server
on 8020
2010-08-27 13:58:02,331 ERROR
org.apache.hadoop.hdfs.server.namenode.NameNode:
org.apache.hadoop.hdfs.server.common.InconsistentFSStateException: Directory
/hadoop/nam
einfo is in an inconsistent state: storage directory does not exist or is
not accessible.
        at
org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:290)
        at
org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:88)
        at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:312)
        at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.java:293)
        at
org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:224)
        at
org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:306)
        at
org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1006)
        at
org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1015)

2010-08-27 13:58:02,332 INFO
org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG:



On Tue, Aug 24, 2010 at 5:09 PM, jiang licht <licht_jiang@yahoo.com> wrote:

> Thanks for sharing your experiments, Sudhir. I guess this applies to a nfs
> mounted folder as well?
>
> Will someone from Cloudera confirm the same behavior of other releases?
>
> Thanks,
>
> Michael
>
> --- On Tue, 8/24/10, Sudhir Vallamkondu <Sudhir.Vallamkondu@icrossing.com>
> wrote:
>
> From: Sudhir Vallamkondu <Sudhir.Vallamkondu@icrossing.com>
> Subject: Re: what will happen if a backup name node folder becomes
> unaccessible?
> To: common-user@hadoop.apache.org
> Date: Tuesday, August 24, 2010, 6:59 PM
>
> The cloudera distribution seems to be working fine when a dfs.name.dir
> directory is inaccessible in midst of namenode running.
>
> See below
>
> hadoop@training-vm:~$ hadoop version
> Hadoop 0.20.1+152
> Subversion  -r c15291d10caa19c2355f437936c7678d537adf94
> Compiled by root on Mon Nov  2 05:15:37 UTC 2009
>
> hadoop@training-vm:~$ jps
> 8923 Jps
> 8548 JobTracker
> 8467 SecondaryNameNode
> 8250 NameNode
> 8357 DataNode
> 8642 TaskTracker
>
> hadoop@training-vm:~$ /usr/lib/hadoop/bin/stop-all.sh
> stopping jobtracker
> localhost: stopping tasktracker
> stopping namenode
> localhost: stopping datanode
> localhost: stopping secondarynamenode
>
> hadoop@training-vm:~$ mkdir edit_log_dir1
>
> hadoop@training-vm:~$ mkdir edit_log_dir2
>
> hadoop@training-vm:~$ ls
> edit_log_dir1  edit_log_dir2
>
> hadoop@training-vm:~$ ls -ltr /var/lib/hadoop-0.20/cache/hadoop/dfs/name
> total 8
> drwxr-xr-x 2 hadoop hadoop 4096 2009-10-15 16:17 image
> drwxr-xr-x 2 hadoop hadoop 4096 2010-08-24 15:56 current
>
> hadoop@training-vm:~$ cp -r /var/lib/hadoop-0.20/cache/hadoop/dfs/name
> edit_log_dir1
>
> hadoop@training-vm:~$ cp -r /var/lib/hadoop-0.20/cache/hadoop/dfs/name
> edit_log_dir2
>
> ------ hdfs-site.xml added new dirs
>
> <?xml version="1.0"?>
> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
>
> <configuration>
>   <property>
>     <name>dfs.replication</name>
>     <value>1</value>
>   </property>
>   <property>
>      <name>dfs.permissions</name>
>      <value>false</value>
>   </property>
>   <property>
>      <!-- specify this so that running 'hadoop namenode -format' formats
> the
> right dir -->
>      <name>dfs.name.dir</name>
>
> <value>/var/lib/hadoop-0.20/cache/hadoop/dfs/name,/home/hadoop/edit_log_dir1
> ,/home/hadoop/edit_log_dir2</value>
>   </property>
>    <property>
>      <name>fs.checkpoint.period</name>
>      <value>600</value>
>   </property>
>   <property>
>     <name>dfs.namenode.plugins</name>
>     <value>org.apache.hadoop.thriftfs.NamenodePlugin</value>
>   </property>
>   <property>
>     <name>dfs.datanode.plugins</name>
>     <value>org.apache.hadoop.thriftfs.DatanodePlugin</value>
>   </property>
>   <property>
>     <name>dfs.thrift.address</name>
>     <value>0.0.0.0:9090</value>
>   </property>
> </configuration>
>
> ---- start all daemons
>
> hadoop@training-vm:~$ /usr/lib/hadoop/bin/start-all.sh
> starting namenode, logging to
> /usr/lib/hadoop/bin/../logs/hadoop-hadoop-namenode-training-vm.out
> localhost: starting datanode, logging to
> /usr/lib/hadoop/bin/../logs/hadoop-hadoop-datanode-training-vm.out
> localhost: starting secondarynamenode, logging to
> /usr/lib/hadoop/bin/../logs/hadoop-hadoop-secondarynamenode-training-vm.out
> starting jobtracker, logging to
> /usr/lib/hadoop/bin/../logs/hadoop-hadoop-jobtracker-training-vm.out
> localhost: starting tasktracker, logging to
> /usr/lib/hadoop/bin/../logs/hadoop-hadoop-tasktracker-training-vm.out
>
>
> -------- namenode log confirms all dirs taken
>
> 2010-08-24 16:20:48,718 INFO
> org.apache.hadoop.hdfs.server.namenode.NameNode: STARTUP_MSG:
> /************************************************************
> STARTUP_MSG: Starting NameNode
> STARTUP_MSG:   host = training-vm/127.0.0.1
> STARTUP_MSG:   args = []
> STARTUP_MSG:   version = 0.20.1+152
> STARTUP_MSG:   build =  -r c15291d10caa19c2355f437936c7678d537adf94;
> compiled by 'root' on Mon Nov  2 05:15:37 UTC 2009
> ************************************************************/
> 2010-08-24 16:20:48,815 INFO org.apache.hadoop.ipc.metrics.RpcMetrics:
> Initializing RPC Metrics with hostName=NameNode, port=8022
> 2010-08-24 16:20:48,819 INFO
> org.apache.hadoop.hdfs.server.namenode.NameNode: Namenode up at:
> localhost/127.0.0.1:8022
> 2010-08-24 16:20:48,821 INFO org.apache.hadoop.metrics.jvm.JvmMetrics:
> Initializing JVM Metrics with processName=NameNode, sessionId=null
> 2010-08-24 16:20:48,822 INFO
> org.apache.hadoop.hdfs.server.namenode.metrics.NameNodeMetrics:
> Initializing
> NameNodeMeterics using context
> object:org.apache.hadoop.metrics.spi.NoEmitMetricsContext
> 2010-08-24 16:20:48,894 INFO
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: fsOwner=hadoop,hadoop
> 2010-08-24 16:20:48,894 INFO
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: supergroup=supergroup
> 2010-08-24 16:20:48,894 INFO
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
> isPermissionEnabled=false
> 2010-08-24 16:20:48,903 INFO
> org.apache.hadoop.hdfs.server.namenode.metrics.FSNamesystemMetrics:
> Initializing FSNamesystemMetrics using context
> object:org.apache.hadoop.metrics.spi.NoEmitMetrics
> Context
> 2010-08-24 16:20:48,905 INFO
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Registered
> FSNamesystemStatusMBean
> 2010-08-24 16:20:48,937 INFO org.apache.hadoop.hdfs.server.common.Storage:
> Storage directory /home/hadoop/edit_log_dir1 is not formatted.
> 2010-08-24 16:20:48,937 INFO org.apache.hadoop.hdfs.server.common.Storage:
> Formatting ...
> 2010-08-24 16:20:48,937 INFO org.apache.hadoop.hdfs.server.common.Storage:
> Storage directory /home/hadoop/edit_log_dir2 is not formatted.
> 2010-08-24 16:20:48,937 INFO org.apache.hadoop.hdfs.server.common.Storage:
> Formatting ...
> 2010-08-24 16:20:48,938 INFO org.apache.hadoop.hdfs.server.common.Storage:
> Number of files = 41
> 2010-08-24 16:20:48,947 INFO org.apache.hadoop.hdfs.server.common.Storage:
> Number of files under construction = 0
> 2010-08-24 16:20:48,947 INFO org.apache.hadoop.hdfs.server.common.Storage:
> Image file of size 4357 loaded in 0 seconds.
>
> ---- directories confirm in use
>
> hadoop@training-vm:~$ ls -ltr edit_log_dir1
> total 12
> drwxr-xr-x 4 hadoop hadoop 4096 2010-08-24 16:01 name
> -rw-r--r-- 1 hadoop hadoop    0 2010-08-24 16:20 in_use.lock
> drwxr-xr-x 2 hadoop hadoop 4096 2010-08-24 16:20 image
> drwxr-xr-x 2 hadoop hadoop 4096 2010-08-24 16:20 current
>
> hadoop@training-vm:~$ ls -ltr edit_log_dir2
> total 12
> drwxr-xr-x 4 hadoop hadoop 4096 2010-08-24 16:01 name
> -rw-r--r-- 1 hadoop hadoop    0 2010-08-24 16:20 in_use.lock
> drwxr-xr-x 2 hadoop hadoop 4096 2010-08-24 16:20 image
> drwxr-xr-x 2 hadoop hadoop 4096 2010-08-24 16:20 current
>
> ----- secondary name node checkpoint worked fine
>
> 2010-08-24 16:27:10,555 INFO
> org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: Downloaded file
> fsimage size 4357 bytes.
> 2010-08-24 16:27:10,557 INFO
> org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: Downloaded file
> edits size 895 bytes.
> 2010-08-24 16:27:10,603 INFO
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: fsOwner=hadoop,hadoop
> 2010-08-24 16:27:10,603 INFO
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: supergroup=supergroup
> 2010-08-24 16:27:10,603 INFO
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
> isPermissionEnabled=false
> 2010-08-24 16:27:10,622 INFO org.apache.hadoop.hdfs.server.common.Storage:
> Number of files = 41
> 2010-08-24 16:27:10,629 INFO org.apache.hadoop.hdfs.server.common.Storage:
> Number of files under construction = 0
> 2010-08-24 16:27:10,635 INFO org.apache.hadoop.hdfs.server.common.Storage:
> Edits file
> /var/lib/hadoop-0.20/cache/hadoop/dfs/namesecondary/current/edits
> of size 895 edits # 10 loaded in 0 seconds
> .
> 2010-08-24 16:27:10,658 INFO org.apache.hadoop.hdfs.server.common.Storage:
> Image file of size 4461 saved in 0 seconds.
> 2010-08-24 16:27:10,745 INFO
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Number of
> transactions:
> 0 Total time for transactions(ms): 0Number of transactions batched in
> Syncs:
> 0 Number of syncs: 0 SyncTimes(ms): 0
> 2010-08-24 16:27:10,756 INFO
> org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: Posted URL
>
> localhost:50070putimage=1&port=50090&machine=127.0.0.1&token=-18:1431678956:
> 1255648991179:1282692430000:1282692049090
> 2010-08-24 16:27:11,008 WARN
> org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: Checkpoint done.
> New Image Size: 4461
>
> --- dirctory put works fine
>
> hadoop@training-vm:~$ hadoop fs -ls /user/training
> Found 3 items
> drwxr-xr-x   - training supergroup          0 2010-06-30 13:18
> /user/training/grep_output
> drwxr-xr-x   - training supergroup          0 2010-06-30 13:14
> /user/training/input
> drwxr-xr-x   - training supergroup          0 2010-06-30 15:30
> /user/training/output
>
> hadoop@training-vm:~$ hadoop fs -put
> /etc/hadoop/conf.with-desktop/hdfs-site.xml /user/training
>
> hadoop@training-vm:~$ hadoop fs -ls /user/training
> Found 4 items
> drwxr-xr-x   - training supergroup          0 2010-06-30 13:18
> /user/training/grep_output
> -rw-r--r--   1 hadoop   supergroup        987 2010-08-24 16:25
> /user/training/hdfs-site.xml
> drwxr-xr-x   - training supergroup          0 2010-06-30 13:14
> /user/training/input
> drwxr-xr-x   - training supergroup          0 2010-06-30 15:30
> /user/training/output
>
>
> ------ delete one of the directories
> hadoop@training-vm:~$ rm -rf edit_log_dir2
>
> hadoop@training-vm:~$ ls -ltr
> total 4
> drwxr-xr-x 5 hadoop hadoop 4096 2010-08-24 16:20 edit_log_dir1
>
> -- namenode logs
>
> No errors/warns in logs
>
> -------- namenode still running
>
> hadoop@training-vm:~$ jps
> 12426 NameNode
> 12647 SecondaryNameNode
> 12730 JobTracker
> 14090 Jps
> 12535 DataNode
> 12826 TaskTracker
>
> ----  puts and ls work fine
>
> hadoop@training-vm:~$ hadoop fs -ls /user/training
> Found 4 items
> drwxr-xr-x   - training supergroup          0 2010-06-30 13:18
> /user/training/grep_output
> -rw-r--r--   1 hadoop   supergroup        987 2010-08-24 16:25
> /user/training/hdfs-site.xml
> drwxr-xr-x   - training supergroup          0 2010-06-30 13:14
> /user/training/input
> drwxr-xr-x   - training supergroup          0 2010-06-30 15:30
> /user/training/output
>
> hadoop@training-vm:~$ hadoop fs -put
> /etc/hadoop/conf.with-desktop/core-site.xml /user/training
>
> hadoop@training-vm:~$ hadoop fs -put
> /etc/hadoop/conf.with-desktop/mapred-site.xml /user/training
>
> hadoop@training-vm:~$ hadoop fs -ls /user/training
> Found 6 items
> -rw-r--r--   1 hadoop   supergroup        338 2010-08-24 16:28
> /user/training/core-site.xml
> drwxr-xr-x   - training supergroup          0 2010-06-30 13:18
> /user/training/grep_output
> -rw-r--r--   1 hadoop   supergroup        987 2010-08-24 16:25
> /user/training/hdfs-site.xml
> drwxr-xr-x   - training supergroup          0 2010-06-30 13:14
> /user/training/input
> -rw-r--r--   1 hadoop   supergroup        454 2010-08-24 16:29
> /user/training/mapred-site.xml
> drwxr-xr-x   - training supergroup          0 2010-06-30 15:30
> /user/training/output
>
> ------- secondary namenode checkpoint is successdul
>
> 2010-08-24 16:37:11,455 WARN
> org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: Checkpoint done.
> New Image Size: 4671
> ....
> 2010-08-24 16:47:11,884 WARN
> org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: Checkpoint done.
> New Image Size: 4671
> ...
> 2010-08-24 16:57:12,264 WARN
> org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: Checkpoint done.
> New Image Size: 4671
>
> ------- after 30 mins
>
> hadoop@training-vm:~$ jps
> 12426 NameNode
> 12647 SecondaryNameNode
> 12730 JobTracker
> 16256 Jps
> 12535 DataNode
> 12826 TaskTracker
>
>
>
>
>
>
>
>
> On Aug/24/ 11:05 AM, "common-user-digest-help@hadoop.apache.org"
> <common-user-digest-help@hadoop.apache.org> wrote:
>
> > From: jiang licht <licht_jiang@yahoo.com>
> > Date: Tue, 24 Aug 2010 10:38:32 -0700 (PDT)
> > To: <common-user@hadoop.apache.org>
> > Subject: Re: what will happen if a backup name node folder becomes
> > unaccessible?
> >
> > Sudhir,
> >
> > Look forward to your results, if possible with different CDH releases.
> >
> > Thanks,
> >
> > Michael
> >
> > --- On Tue, 8/24/10, Sudhir Vallamkondu <
> Sudhir.Vallamkondu@icrossing.com>
> > wrote:
> >
> > From: Sudhir Vallamkondu <Sudhir.Vallamkondu@icrossing.com>
> > Subject: Re: what will happen if a backup name node folder becomes
> > unaccessible?
> > To: common-user@hadoop.apache.org
> > Date: Tuesday, August 24, 2010, 10:47 AM
> >
> > Harsh
> >
> > You seem to be getting an ┬│all storage directories inaccessible┬▓ error.
> > Strange coz as per code that gets thrown when all dirs are inaccessible.
> > Anycase will test it on cloudera distribution today and publish results.
> >
> > - Sudhir
> >
> >
> > On Aug/24/ 5:08 AM, "common-user-digest-help@hadoop.apache.org"
> > <common-user-digest-help@hadoop.apache.org> wrote:
> >
> >> From: Harsh J <qwertymaniac@gmail.com>
> >> Date: Tue, 24 Aug 2010 11:41:48 +0530
> >> To: <common-user@hadoop.apache.org>
> >> Subject: Re: common-user Digest 23 Aug 2010 21:21:26 -0000 Issue 1518
> >>
> >> Hello Sudhir,
> >>
> >> You're right about this, but I don't seem to be getting the warning for
> the
> >> edit log IOException at all in the first place. Here's my steps to get
> to
> >> what I described earlier (note that am just using two directories on the
> >> same disk, not two different devices or nfs, etc.) Its my personal
> computer
> >> so I don't mind doing this again for now (as the other directory remains
> >> untouched).
> >>
> >> *hadoop 11:13:00 ~/.hadoop $* jps
> >>
> >> 4954 SecondaryNameNode
> >>
> >> 5911 Jps
> >>
> >> 5158 TaskTracker
> >>
> >> 4592 NameNode
> >>
> >> 5650 JobTracker
> >>
> >> 4768 DataNode
> >>
> >> *hadoop 11:13:02 ~/.hadoop $* hadoop dfs -ls
> >>
> >> Found 2 items
> >>
> >> -rw-r--r--   1 hadoop supergroup     411536 2010-08-18 15:50
> >> /user/hadoop/data
> >> drwxr-xr-x   - hadoop supergroup          0 2010-08-18 16:02
> >> /user/hadoop/dataout
> >> hadoop 11:13:07 ~/.hadoop $ tail -n 10 conf/hdfs-site.xml
> >>
> >>   <property>
> >>
> >>     <name>*dfs.name.dir*</name>
> >>
> >>     <value>/home/hadoop/.dfs/name,*/home/hadoop/.dfs/testdir*</value>
> >>
> >>     <final>true</final>
> >>
> >>   </property>
> >>
> >>   <property>
> >>
> >>     <name>dfs.datanode.max.xcievers</name>
> >>
> >>     <value>2047</value>
> >>
> >>   </property>
> >>
> >> </configuration>
> >>
> >> *hadoop 11:13:25 ~/.hadoop $* ls ~/.dfs/
> >>
> >> data  name  testdir
> >>
> >> *hadoop 11:13:36 ~/.hadoop $ rm -r ~/.dfs/testdir  *
> >>
> >> *hadoop 11:13:49 ~/.hadoop $* jps
> >>
> >> 6135 Jps
> >>
> >> 4954 SecondaryNameNode
> >>
> >> 5158 TaskTracker
> >>
> >> 4592 NameNode
> >>
> >> 5650 JobTracker
> >>
> >> 4768 DataNode
> >>
> >> *hadoop 11:13:56 ~/.hadoop $* hadoop dfs -put /etc/profile profile1
> >>
> >> *hadoop 11:14:10 ~/.hadoop $* hadoop dfs -put /etc/profile profile2
> >>
> >> *hadoop 11:14:12 ~/.hadoop $* hadoop dfs -put /etc/profile profile3
> >>
> >> *hadoop 11:14:15 ~/.hadoop $* hadoop dfs -put /etc/profile profile4
> >>
> >>
> >> *hadoop 11:17:21 ~/.hadoop $* jps
> >> 4954 SecondaryNameNode
> >>
> >> 5158 TaskTracker
> >>
> >> 4592 NameNode
> >>
> >> 5650 JobTracker
> >>
> >> 4768 DataNode
> >>
> >> 6954 Jps
> >>
> >> *hadoop 11:17:23 ~/.hadoop $* tail -f
> >> hadoop-0.20.2/logs/hadoop-hadoop-namenode-hadoop.log
> >> 2010-08-24 11:14:17,632 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
> >> NameSystem.allocateBlock: /user/hadoop/profile4.
> blk_28644972299224370_1019
> >>
> >> 2010-08-24 11:14:17,709 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
> >> NameSystem.addStoredBlock: blockMap updated: 192.168.1.8:50010 is added
> to
> >> blk_28644972299224370_1019 size 497
> >> 2010-08-24 11:14:17,713 INFO org.apache.hadoop.hdfs.StateChange: DIR*
> >> NameSystem.completeFile: file /user/hadoop/profile4 is closed by
> >> DFSClient_-2054565417
> >> 2010-08-24 11:17:31,187 INFO
> >> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Roll Edit Log from
> >> 192.168.1.8
> >>
> >> 2010-08-24 11:17:31,187 INFO
> >> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Number of
> transactions:
> >> 19 Total time for transactions(ms): 4Number of transactions batched in
> >> Syncs: 0 Number of syncs: 14 SyncTimes(ms): 183 174
> >>
> >> 2010-08-24 11:17:31,281 FATAL
> >> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Fatal Error : All
> >> storage directories are inaccessible.
> >>
> >> 2010-08-24 11:17:31,283 INFO
> >> org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG:
> >> /************************************************************
> >>
> >> SHUTDOWN_MSG: Shutting down NameNode at hadoop.cf.net/127.0.0.1
> >>
> >> ************************************************************/
> >>
> >> ^C
> >> *hadoop 11:17:51 ~/.hadoop $* ls /home/hadoop/.dfs/
> >>
> >> data  name
> >> *hadoop 11:21:14 ~/.hadoop $* jps
> >> 8259 Jps
> >>
> >> 4954 SecondaryNameNode
> >>
> >> 5158 TaskTracker
> >>
> >> 5650 JobTracker
> >>
> >> 4768 DataNode
> >> *hadoop 11:36:03 ~/.hadoop $* mkdir ~/.dfs/testdir
> >> *hadoop 11:36:04 ~/.hadoop $ *stop-all.sh
> >> stopping jobtracker
> >>
> >> localhost: stopping tasktracker
> >>
> >> no namenode to stop
> >>
> >> localhost: stopping datanode
> >>
> >> localhost: stopping secondarynamenode
> >> *hadoop 11:37:01 ~/.hadoop $ *start-all.sh
> >> starting namenode, logging to
> >>
>
> /home/hadoop/.hadoop/hadoop-0.20.2/bin/../logs/hadoop-hadoop-namenode-hadoop.>>
> o
> >> ut
> >>
> >>
> >> localhost: starting datanode, logging to
> >>
>
> /home/hadoop/.hadoop/hadoop-0.20.2/bin/../logs/hadoop-hadoop-datanode-hadoop.>>
> o
> >> ut
> >>
> >> localhost: starting secondarynamenode, logging to
> >>
>
> /home/hadoop/.hadoop/hadoop-0.20.2/bin/../logs/hadoop-hadoop-secondarynamenod>>
> e
> >> -hadoop.out
> >>
> >> starting jobtracker, logging to
> >>
>
> /home/hadoop/.hadoop/hadoop-0.20.2/bin/../logs/hadoop-hadoop-jobtracker-hadoo>>
> p
> >> .out
> >>
> >>
> >> localhost: starting tasktracker, logging to
> >>
>
> /home/hadoop/.hadoop/hadoop-0.20.2/bin/../logs/hadoop-hadoop-tasktracker-hado>>
> o
> >> p.out
> >> *hadoop 11:39:30 ~/.hadoop $* hadoop dfs -ls
> >> Found 6 items
> >>
> >> -rw-r--r--   1 hadoop supergroup     411536 2010-08-18 15:50
> >> /user/hadoop/data
> >> drwxr-xr-x   - hadoop supergroup          0 2010-08-18 16:02
> >> /user/hadoop/dataout
> >> -rw-r--r--   1 hadoop supergroup        497 2010-08-24 11:14
> >> /user/hadoop/profile1
> >> -rw-r--r--   1 hadoop supergroup        497 2010-08-24 11:14
> >> /user/hadoop/profile2
> >> -rw-r--r--   1 hadoop supergroup        497 2010-08-24 11:14
> >> /user/hadoop/profile3
> >> -rw-r--r--   1 hadoop supergroup        497 2010-08-24 11:14
> >> /user/hadoop/profile4
> >>
> >>
> >>
> >> On Tue, Aug 24, 2010 at 10:49 AM, Sudhir Vallamkondu <
> >> Sudhir.Vallamkondu@icrossing.com> wrote:
> >>>> Looking at the codebase it seems to suggest that it ignores a editlog
> >>>> storage directory if it encounters an error
> >>>>
> >>>>
> >>
> http://www.google.com/codesearch/p?hl=en#GLh8vwsjDqs/trunk/src/hdfs/org/apac
> >>>>
> >>
> he/hadoop/hdfs/server/namenode/FSEditLog.java&q=namenode%20editlog&sa=N&cd=2
> >>>> 0&ct=rc
> >>>>
> >>>> Check lines:
> >>>> Code in line 334
> >>>> comment: 387 - 390
> >>>> comment: 411 - 414
> >>>> Comment: 433 - 436
> >>>>
> >>>> The processIOError method is called throughout the code if it
> encounters
> >> an
> >>>> IOException.
> >>>>
> >>>> A fatal error is only thrown if none of the storage directories is
> >>>> accessible. Lines 394, 420
> >>>>
> >>>> - Sudhir
> >>>>
> >>>>
> >>>>
> >>>> On Aug/23/ 2:21 PM, "common-user-digest-help@hadoop.apache.org"
> >>>> <common-user-digest-help@hadoop.apache.org> wrote:
> >>>>
> >>>>>> From: Michael Segel <michael_segel@hotmail.com>
> >>>>>> Date: Mon, 23 Aug 2010 14:05:05 -0500
> >>>>>> To: <common-user@hadoop.apache.org>
> >>>>>> Subject: RE: what will happen if a backup name node folder becomes
> >>>>>> unaccessible?
> >>>>>>
> >>>>>>
> >>>>>> Ok...
> >>>>>>
> >>>>>> Now you have me confused.
> >>>>>> Everything we've seen says that writing to both a local disk
and to
> an
> >> NFS
> >>>>>> mounted disk would be the best way to prevent a problem.
> >>>>>>
> >>>>>> Now you and Harsh J say that this could actually be problematic.
> >>>>>>
> >>>>>> Which is it?
> >>>>>> Is this now a defect that should be addressed, or should we
just not
> use
> >> an
> >>>>>> NFS mounted drive?
> >>>>>>
> >>>>>> Thx
> >>>>>>
> >>>>>> -Mike
> >>>>>>
> >>>>>>
> >>>>>>>> Date: Mon, 23 Aug 2010 11:42:59 -0700
> >>>>>>>> From: licht_jiang@yahoo.com
> >>>>>>>> Subject: Re: what will happen if a backup name node
folder becomes
> >>>>>>>> unaccessible?
> >>>>>>>> To: common-user@hadoop.apache.org
> >>>>>>>>
> >>>>>>>> This makes a good argument. Actually, after seeing the
previous
> reply,
> > I
> >>>>>>>> kindof convinced that I should go back to "sync" the
meta data to
> a
> >> backup
> >>>>>>>> location instead of using this feature, which as David
mentioned,
> >> introduced
> >>>>>>>> a 2nd single point of failure to hadoop, which degrades
the
> >>>>> availability
> >> of
> >>>>>>>> hadoop. BTW, we are using cloudera package hadoop-0.20.2+228.
Can
> >> someone
> >>>>>>>> confirm whether a name node will shut down given that
a backup
> folder
> >> listed
> >>>>>>>> in "dfs.name.dir" becomes unavailable in this version?
> >>>>>>>>
> >>>>>>>> Thanks,
> >>>>>>>>
> >>>>>>>> Michael
> >>>>>>>>
> >>>>>>>> --- On Sun, 8/22/10, David B. Ritch <david.ritch@gmail.com>
> wrote:
> >>>>>>>>
> >>>>>>>> From: David B. Ritch <david.ritch@gmail.com>
> >>>>>>>> Subject: Re: what will happen if a backup name node
folder becomes
> >>>>>>>> unaccessible?
> >>>>>>>> To: common-user@hadoop.apache.org
> >>>>>>>> Date: Sunday, August 22, 2010, 11:34 PM
> >>>>>>>>
> >>>>>>>>   Which version of Hadoop was this?  The folks at Cloudera
have
> assured
> >>>>>>>> me that the namenode in CDH2 will continue as long as
one of the
> >>>>>>>> directories is still writable.
> >>>>>>>>
> >>>>>>>> It *does* seem a bit of a waste if an availability feature
- the
> >>>>> ability
> >>>>>>>> to write to multiple directories - actually reduces
availability
> by
> >>>>>>>> providing an additional single point of failure.
> >>>>>>>>
> >>>>>>>> Thanks!
> >>>>>>>>
> >>>>>>>> dbr
> >>>>>>>>
> >>>>>>>> On 8/20/2010 5:27 PM, Harsh J wrote:
> >>>>>>>>>> Whee, lets try it out:
> >>>>>>>>>>
> >>>>>>>>>> Start with both paths available. ... Starts
fine.
> >>>>>>>>>> Store some files. ... Works.
> >>>>>>>>>> rm -r the second path. ... Ouch.
> >>>>>>>>>> Store some more files. ... Still Works. [Cuz
the SNN hasn't sent
> us
> >>>>>>>>>> stuff back yet]
> >>>>>>>>>> Wait for checkpoint to hit.
> >>>>>>>>>> And ...
> >>>>>>>>>> Boom!
> >>>>>>>>>>
> >>>>>>>>>> 2010-08-21 02:42:00,385 INFO
> >>>>>>>>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
Roll Edit
> Log
> >>>>>>>>>> from 127.0.0.1
> >>>>>>>>>> 2010-08-21 02:42:00,385 INFO
> >>>>>>>>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
Number of
> >>>>>>>>>> transactions: 37 Total time for transactions(ms):
6Number of
> >>>>>>>>>> transactions batched in Syncs: 0 Number of syncs:
26
> SyncTimes(ms):
> >>>>>>>>>> 307 277
> >>>>>>>>>> 2010-08-21 02:42:00,439 FATAL
> >>>>>>>>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
Fatal Error
> :
> > All
> >>>>>>>>>> storage directories are inaccessible.
> >>>>>>>>>> 2010-08-21 02:42:00,440 INFO
> >>>>>>>>>> org.apache.hadoop.hdfs.server.namenode.NameNode:
SHUTDOWN_MSG:
> >>>>>>>>>> /************************************************************
> >>>>>>>>>> SHUTDOWN_MSG: Shutting down NameNode at localhost/127.0.0.1
> >>>>>>>>>> ************************************************************/
> >>>>>>>>>>
> >>>>>>>>>> So yes, as Edward says - never let this happen!
> >>>>>>>>>>
> >>>>>>>>>> On Sat, Aug 21, 2010 at 2:26 AM, jiang licht
<
> licht_jiang@yahoo.com>
> >> wrote:
> >>>>>>>>>>>> Using nfs folder to back up dfs meta
information as follows,
> >>>>>>>>>>>>
> >>>>>>>>>>>> <property>
> >>>>>>>>>>>>          <name>dfs.name.dir</name>
> >>>>>>>>>>>>
>    <value>/hadoop/dfs/name,/hadoop-backup/dfs/name</value>
> >>>>>>>>>>>>      </property>
> >>>>>>>>>>>>
> >>>>>>>>>>>> where /hadoop-backup is on a backup
machine and mounted on the
> >>>>>>> master
> >> node.
> >>>>>>>>>>>>
> >>>>>>>>>>>> I have a question: if somehow, the backup
folder becomes
> >>>>>>> unavailable,
> >> will
> >>>>>>>>>>>> it freeze master node? That is, will
write operation simply
> hang up
> > on
> >> this
> >>>>>>>>>>>> condition on the master node? Or will
master node log the
> problem
> > and
> >>>>>>>>>>>> continues to work?
> >>>>>>>>>>>>
> >>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>
> >>>>>>>>>>>> Michael
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>
> >>>>
> >>>> iCrossing Privileged and Confidential Information
> >>>> This email message is for the sole use of the intended recipient(s)
> and
> >> may contain confidential and privileged information of iCrossing. Any
> >> unauthorized review, use, disclosure or distribution is prohibited. If
> you
> >> are not the intended recipient, please contact the sender by reply email
> and
> >> destroy all copies of the original message.
> >>>>
> >>>>
> >>>>
> >>
> >> Above steps were done performed using Apache Hadoop 0.20.2. Not
> cloudera's
> >> version of it, if that helps.
> >>
> >> --
> >> Harsh J
> >> www.harshj.com
> >
>
>
>
>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message