hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sudhir Vallamkondu <Sudhir.Vallamko...@icrossing.com>
Subject Re: what will happen if a backup name node folder becomes unaccessible?
Date Tue, 24 Aug 2010 15:47:13 GMT
Harsh

You seem to be getting an ┬│all storage directories inaccessible┬▓ error.
Strange coz as per code that gets thrown when all dirs are inaccessible.
Anycase will test it on cloudera distribution today and publish results.

- Sudhir


On Aug/24/ 5:08 AM, "common-user-digest-help@hadoop.apache.org"
<common-user-digest-help@hadoop.apache.org> wrote:

> From: Harsh J <qwertymaniac@gmail.com>
> Date: Tue, 24 Aug 2010 11:41:48 +0530
> To: <common-user@hadoop.apache.org>
> Subject: Re: common-user Digest 23 Aug 2010 21:21:26 -0000 Issue 1518
> 
> Hello Sudhir,
> 
> You're right about this, but I don't seem to be getting the warning for the
> edit log IOException at all in the first place. Here's my steps to get to
> what I described earlier (note that am just using two directories on the
> same disk, not two different devices or nfs, etc.) Its my personal computer
> so I don't mind doing this again for now (as the other directory remains
> untouched).
> 
> *hadoop 11:13:00 ~/.hadoop $* jps
> 
> 4954 SecondaryNameNode
> 
> 5911 Jps
> 
> 5158 TaskTracker
> 
> 4592 NameNode
> 
> 5650 JobTracker
> 
> 4768 DataNode
> 
> *hadoop 11:13:02 ~/.hadoop $* hadoop dfs -ls
> 
> Found 2 items
> 
> -rw-r--r--   1 hadoop supergroup     411536 2010-08-18 15:50
> /user/hadoop/data
> drwxr-xr-x   - hadoop supergroup          0 2010-08-18 16:02
> /user/hadoop/dataout
> hadoop 11:13:07 ~/.hadoop $ tail -n 10 conf/hdfs-site.xml
> 
>  <property>
> 
>    <name>*dfs.name.dir*</name>
> 
>    <value>/home/hadoop/.dfs/name,*/home/hadoop/.dfs/testdir*</value>
> 
>    <final>true</final>
> 
>  </property>
> 
>  <property>
> 
>    <name>dfs.datanode.max.xcievers</name>
> 
>    <value>2047</value>
> 
>  </property>
> 
> </configuration>
> 
> *hadoop 11:13:25 ~/.hadoop $* ls ~/.dfs/
> 
> data  name  testdir
> 
> *hadoop 11:13:36 ~/.hadoop $ rm -r ~/.dfs/testdir  *
> 
> *hadoop 11:13:49 ~/.hadoop $* jps
> 
> 6135 Jps
> 
> 4954 SecondaryNameNode
> 
> 5158 TaskTracker
> 
> 4592 NameNode
> 
> 5650 JobTracker
> 
> 4768 DataNode
> 
> *hadoop 11:13:56 ~/.hadoop $* hadoop dfs -put /etc/profile profile1
> 
> *hadoop 11:14:10 ~/.hadoop $* hadoop dfs -put /etc/profile profile2
> 
> *hadoop 11:14:12 ~/.hadoop $* hadoop dfs -put /etc/profile profile3
> 
> *hadoop 11:14:15 ~/.hadoop $* hadoop dfs -put /etc/profile profile4
> 
> 
> *hadoop 11:17:21 ~/.hadoop $* jps
> 4954 SecondaryNameNode
> 
> 5158 TaskTracker
> 
> 4592 NameNode
> 
> 5650 JobTracker
> 
> 4768 DataNode
> 
> 6954 Jps
> 
> *hadoop 11:17:23 ~/.hadoop $* tail -f
> hadoop-0.20.2/logs/hadoop-hadoop-namenode-hadoop.log
> 2010-08-24 11:14:17,632 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
> NameSystem.allocateBlock: /user/hadoop/profile4. blk_28644972299224370_1019
> 
> 2010-08-24 11:14:17,709 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
> NameSystem.addStoredBlock: blockMap updated: 192.168.1.8:50010 is added to
> blk_28644972299224370_1019 size 497
> 2010-08-24 11:14:17,713 INFO org.apache.hadoop.hdfs.StateChange: DIR*
> NameSystem.completeFile: file /user/hadoop/profile4 is closed by
> DFSClient_-2054565417
> 2010-08-24 11:17:31,187 INFO
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Roll Edit Log from
> 192.168.1.8
> 
> 2010-08-24 11:17:31,187 INFO
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Number of transactions:
> 19 Total time for transactions(ms): 4Number of transactions batched in
> Syncs: 0 Number of syncs: 14 SyncTimes(ms): 183 174
> 
> 2010-08-24 11:17:31,281 FATAL
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Fatal Error : All
> storage directories are inaccessible.
> 
> 2010-08-24 11:17:31,283 INFO
> org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG:
> /************************************************************
> 
> SHUTDOWN_MSG: Shutting down NameNode at hadoop.cf.net/127.0.0.1
> 
> ************************************************************/
> 
> ^C
> *hadoop 11:17:51 ~/.hadoop $* ls /home/hadoop/.dfs/
> 
> data  name
> *hadoop 11:21:14 ~/.hadoop $* jps
> 8259 Jps
> 
> 4954 SecondaryNameNode
> 
> 5158 TaskTracker
> 
> 5650 JobTracker
> 
> 4768 DataNode
> *hadoop 11:36:03 ~/.hadoop $* mkdir ~/.dfs/testdir
> *hadoop 11:36:04 ~/.hadoop $ *stop-all.sh
> stopping jobtracker
> 
> localhost: stopping tasktracker
> 
> no namenode to stop
> 
> localhost: stopping datanode
> 
> localhost: stopping secondarynamenode
> *hadoop 11:37:01 ~/.hadoop $ *start-all.sh
> starting namenode, logging to
> /home/hadoop/.hadoop/hadoop-0.20.2/bin/../logs/hadoop-hadoop-namenode-hadoop.o
> ut
> 
> 
> localhost: starting datanode, logging to
> /home/hadoop/.hadoop/hadoop-0.20.2/bin/../logs/hadoop-hadoop-datanode-hadoop.o
> ut
> 
> localhost: starting secondarynamenode, logging to
> /home/hadoop/.hadoop/hadoop-0.20.2/bin/../logs/hadoop-hadoop-secondarynamenode
> -hadoop.out
> 
> starting jobtracker, logging to
> /home/hadoop/.hadoop/hadoop-0.20.2/bin/../logs/hadoop-hadoop-jobtracker-hadoop
> .out
> 
> 
> localhost: starting tasktracker, logging to
> /home/hadoop/.hadoop/hadoop-0.20.2/bin/../logs/hadoop-hadoop-tasktracker-hadoo
> p.out
> *hadoop 11:39:30 ~/.hadoop $* hadoop dfs -ls
> Found 6 items
> 
> -rw-r--r--   1 hadoop supergroup     411536 2010-08-18 15:50
> /user/hadoop/data
> drwxr-xr-x   - hadoop supergroup          0 2010-08-18 16:02
> /user/hadoop/dataout
> -rw-r--r--   1 hadoop supergroup        497 2010-08-24 11:14
> /user/hadoop/profile1
> -rw-r--r--   1 hadoop supergroup        497 2010-08-24 11:14
> /user/hadoop/profile2
> -rw-r--r--   1 hadoop supergroup        497 2010-08-24 11:14
> /user/hadoop/profile3
> -rw-r--r--   1 hadoop supergroup        497 2010-08-24 11:14
> /user/hadoop/profile4
> 
> 
> 
> On Tue, Aug 24, 2010 at 10:49 AM, Sudhir Vallamkondu <
> Sudhir.Vallamkondu@icrossing.com> wrote:
>> > Looking at the codebase it seems to suggest that it ignores a editlog
>> > storage directory if it encounters an error
>> >
>> >
> http://www.google.com/codesearch/p?hl=en#GLh8vwsjDqs/trunk/src/hdfs/org/apac
>> >
> he/hadoop/hdfs/server/namenode/FSEditLog.java&q=namenode%20editlog&sa=N&cd=2
>> > 0&ct=rc
>> >
>> > Check lines:
>> > Code in line 334
>> > comment: 387 - 390
>> > comment: 411 - 414
>> > Comment: 433 - 436
>> >
>> > The processIOError method is called throughout the code if it encounters
> an
>> > IOException.
>> >
>> > A fatal error is only thrown if none of the storage directories is
>> > accessible. Lines 394, 420
>> >
>> > - Sudhir
>> >
>> >
>> >
>> > On Aug/23/ 2:21 PM, "common-user-digest-help@hadoop.apache.org"
>> > <common-user-digest-help@hadoop.apache.org> wrote:
>> >
>>> >> From: Michael Segel <michael_segel@hotmail.com>
>>> >> Date: Mon, 23 Aug 2010 14:05:05 -0500
>>> >> To: <common-user@hadoop.apache.org>
>>> >> Subject: RE: what will happen if a backup name node folder becomes
>>> >> unaccessible?
>>> >>
>>> >>
>>> >> Ok...
>>> >>
>>> >> Now you have me confused.
>>> >> Everything we've seen says that writing to both a local disk and to
an
> NFS
>>> >> mounted disk would be the best way to prevent a problem.
>>> >>
>>> >> Now you and Harsh J say that this could actually be problematic.
>>> >>
>>> >> Which is it?
>>> >> Is this now a defect that should be addressed, or should we just not
use
> an
>>> >> NFS mounted drive?
>>> >>
>>> >> Thx
>>> >>
>>> >> -Mike
>>> >>
>>> >>
>>>> >>> Date: Mon, 23 Aug 2010 11:42:59 -0700
>>>> >>> From: licht_jiang@yahoo.com
>>>> >>> Subject: Re: what will happen if a backup name node folder becomes
>>>> >>> unaccessible?
>>>> >>> To: common-user@hadoop.apache.org
>>>> >>>
>>>> >>> This makes a good argument. Actually, after seeing the previous
reply,
I
>>>> >>> kindof convinced that I should go back to "sync" the meta data
to a
> backup
>>>> >>> location instead of using this feature, which as David mentioned,
> introduced
>>>> >>> a 2nd single point of failure to hadoop, which degrades the
>>>> availability
> of
>>>> >>> hadoop. BTW, we are using cloudera package hadoop-0.20.2+228.
Can
> someone
>>>> >>> confirm whether a name node will shut down given that a backup
folder
> listed
>>>> >>> in "dfs.name.dir" becomes unavailable in this version?
>>>> >>>
>>>> >>> Thanks,
>>>> >>>
>>>> >>> Michael
>>>> >>>
>>>> >>> --- On Sun, 8/22/10, David B. Ritch <david.ritch@gmail.com>
wrote:
>>>> >>>
>>>> >>> From: David B. Ritch <david.ritch@gmail.com>
>>>> >>> Subject: Re: what will happen if a backup name node folder becomes
>>>> >>> unaccessible?
>>>> >>> To: common-user@hadoop.apache.org
>>>> >>> Date: Sunday, August 22, 2010, 11:34 PM
>>>> >>>
>>>> >>>  Which version of Hadoop was this?  The folks at Cloudera have
assured
>>>> >>> me that the namenode in CDH2 will continue as long as one of
the
>>>> >>> directories is still writable.
>>>> >>>
>>>> >>> It *does* seem a bit of a waste if an availability feature -
the
>>>> ability
>>>> >>> to write to multiple directories - actually reduces availability
by
>>>> >>> providing an additional single point of failure.
>>>> >>>
>>>> >>> Thanks!
>>>> >>>
>>>> >>> dbr
>>>> >>>
>>>> >>> On 8/20/2010 5:27 PM, Harsh J wrote:
>>>>> >>>> Whee, lets try it out:
>>>>> >>>>
>>>>> >>>> Start with both paths available. ... Starts fine.
>>>>> >>>> Store some files. ... Works.
>>>>> >>>> rm -r the second path. ... Ouch.
>>>>> >>>> Store some more files. ... Still Works. [Cuz the SNN
hasn't sent us
>>>>> >>>> stuff back yet]
>>>>> >>>> Wait for checkpoint to hit.
>>>>> >>>> And ...
>>>>> >>>> Boom!
>>>>> >>>>
>>>>> >>>> 2010-08-21 02:42:00,385 INFO
>>>>> >>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
Roll Edit Log
>>>>> >>>> from 127.0.0.1
>>>>> >>>> 2010-08-21 02:42:00,385 INFO
>>>>> >>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
Number of
>>>>> >>>> transactions: 37 Total time for transactions(ms): 6Number
of
>>>>> >>>> transactions batched in Syncs: 0 Number of syncs: 26
SyncTimes(ms):
>>>>> >>>> 307 277
>>>>> >>>> 2010-08-21 02:42:00,439 FATAL
>>>>> >>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
Fatal Error :
All
>>>>> >>>> storage directories are inaccessible.
>>>>> >>>> 2010-08-21 02:42:00,440 INFO
>>>>> >>>> org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG:
>>>>> >>>> /************************************************************
>>>>> >>>> SHUTDOWN_MSG: Shutting down NameNode at localhost/127.0.0.1
>>>>> >>>> ************************************************************/
>>>>> >>>>
>>>>> >>>> So yes, as Edward says - never let this happen!
>>>>> >>>>
>>>>> >>>> On Sat, Aug 21, 2010 at 2:26 AM, jiang licht <licht_jiang@yahoo.com>
> wrote:
>>>>>> >>>>> Using nfs folder to back up dfs meta information
as follows,
>>>>>> >>>>>
>>>>>> >>>>> <property>
>>>>>> >>>>>         <name>dfs.name.dir</name>
>>>>>> >>>>>         <value>/hadoop/dfs/name,/hadoop-backup/dfs/name</value>
>>>>>> >>>>>     </property>
>>>>>> >>>>>
>>>>>> >>>>> where /hadoop-backup is on a backup machine
and mounted on the
>>>>>> master
> node.
>>>>>> >>>>>
>>>>>> >>>>> I have a question: if somehow, the backup folder
becomes
>>>>>> unavailable,
> will
>>>>>> >>>>> it freeze master node? That is, will write operation
simply hang up
on
> this
>>>>>> >>>>> condition on the master node? Or will master
node log the problem
and
>>>>>> >>>>> continues to work?
>>>>>> >>>>>
>>>>>> >>>>> Thanks,
>>>>>> >>>>>
>>>>>> >>>>> Michael
>>>>>> >>>>>
>>>>>> >>>>>
>>>>>> >>>>>
>>>>> >>>>
>>>>> >>>>
>>>> >>>
>>>> >>>
>>>> >>>
>>>> >>>
>>>> >>>
>> >
>> >
>> > iCrossing Privileged and Confidential Information
>> > This email message is for the sole use of the intended recipient(s) and
> may contain confidential and privileged information of iCrossing. Any
> unauthorized review, use, disclosure or distribution is prohibited. If you
> are not the intended recipient, please contact the sender by reply email and
> destroy all copies of the original message.
>> >
>> >
>> >
> 
> Above steps were done performed using Apache Hadoop 0.20.2. Not cloudera's
> version of it, if that helps.
> 
> -- 
> Harsh J
> www.harshj.com


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message