hadoop-hdfs-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From mail list <louis.hust...@gmail.com>
Subject Re: Question about the QJM HA namenode
Date Fri, 05 Dec 2014 04:17:27 GMT
Hi ,Ray

How can I know the standby name node become active and done the recovery job and can work
in log?
Is there some obvious mark in name node log?


On Dec 5, 2014, at 9:55, mail list <louis.hust.ml@gmail.com> wrote:

> Thanks Ray, I will try this options.
> 
> On Dec 5, 2014, at 5:50, Ray Chiang <rchiang@cloudera.com> wrote:
> 
>> It looks like that's tied to the ipc.client.connect.* properties.  You can adjust
retries & timeout values to something shorter and see if that works for you.
>> 
>> Offhand, I'm not certain if that will affect other services besides HDFS.
>> 
>> -Ray
>> 
>> 
>> On Wed, Dec 3, 2014 at 2:51 AM, mail list <louis.hust.ml@gmail.com> wrote:
>> hadoop-2.3.0-cdh5.1.0
>> 
>> hi, i move QJM from the  l-hbase1.dba.dev.cn0 to another machine, and the down time
reduced to
>> 5 mins, and the log on the l-hbase2.dba.dev.cn0 like below:
>> 
>> {log}
>> 2014-12-03 15:55:51,306 INFO org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer:
Loaded 197 edits starting from txid 6599
>> 2014-12-03 15:55:51,306 INFO org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager:
Marking all datandoes as stale
>> 2014-12-03 15:55:51,307 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
Reprocessing replication and invalidation queues
>> 2014-12-03 15:55:51,307 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
initializing replication queues
>> 2014-12-03 15:55:51,307 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
Will take over writing edit logs at txnid 6797
>> 2014-12-03 15:55:51,313 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: Starting
log segment at 6797
>> 2014-12-03 15:55:51,373 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: Number
of transactions: 1 Total time for transactions(ms): 0 Number of transactions batched in Syncs:
0 Number of syncs: 0 SyncTimes(ms): 0 9
>> 2014-12-03 15:55:51,385 INFO org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor:
Starting CacheReplicationMonitor with interval 30000 milliseconds
>> 2014-12-03 15:55:51,385 INFO org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor:
Rescanning because of pending operations
>> 2014-12-03 15:55:51,678 INFO org.apache.hadoop.fs.TrashPolicyDefault: Namenode trash
configuration: Deletion interval = 1440 minutes, Emptier interval = 0 minutes.
>> 2014-12-03 15:55:51,679 INFO org.apache.hadoop.fs.TrashPolicyDefault: The configured
checkpoint interval is 0 minutes. Using an interval of 1440 minutes that is used for deletion
instead
>> 2014-12-03 15:55:51,693 INFO org.apache.hadoop.hdfs.server.blockmanagement.BlockManager:
Total number of blocks            = 179
>> 2014-12-03 15:55:51,693 INFO org.apache.hadoop.hdfs.server.blockmanagement.BlockManager:
Number of invalid blocks          = 0
>> 2014-12-03 15:55:51,693 INFO org.apache.hadoop.hdfs.server.blockmanagement.BlockManager:
Number of under-replicated blocks = 0
>> 2014-12-03 15:55:51,693 INFO org.apache.hadoop.hdfs.server.blockmanagement.BlockManager:
Number of  over-replicated blocks = 0
>> 2014-12-03 15:55:51,693 INFO org.apache.hadoop.hdfs.server.blockmanagement.BlockManager:
Number of blocks being written    = 4
>> 2014-12-03 15:55:51,693 INFO org.apache.hadoop.hdfs.StateChange: STATE* Replication
Queue initialization scan for invalid, over- and under-replicated blocks completed in 386
msec
>> 2014-12-03 15:55:51,693 INFO org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor:
Scanned 0 directive(s) and 0 block(s) in 308 millisecond(s).
>> 2014-12-03 15:56:21,385 INFO org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor:
Rescanning after 30000 milliseconds
>> 2014-12-03 15:56:21,386 INFO org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor:
Scanned 0 directive(s) and 0 block(s) in 0 millisecond(s).
>> 2014-12-03 15:56:51,386 INFO org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor:
Rescanning after 30001 milliseconds
>> 2014-12-03 15:56:51,386 INFO org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor:
Scanned 0 directive(s) and 0 block(s) in 0 millisecond(s).
>> 2014-12-03 15:57:21,387 INFO org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor:
Rescanning after 30000 milliseconds
>> 2014-12-03 15:57:21,387 INFO org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor:
Scanned 0 directive(s) and 0 block(s) in 1 millisecond(s).
>> 2014-12-03 15:57:51,386 INFO org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor:
Rescanning after 30000 milliseconds
>> 2014-12-03 15:57:51,386 INFO org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor:
Scanned 0 directive(s) and 0 block(s) in 0 millisecond(s).
>> 2014-12-03 15:58:21,387 INFO org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor:
Rescanning after 30000 milliseconds
>> 2014-12-03 15:58:21,387 INFO org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor:
Scanned 0 directive(s) and 0 block(s) in 1 millisecond(s).
>> 2014-12-03 15:58:51,386 INFO org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor:
Rescanning after 30000 milliseconds
>> 2014-12-03 15:58:51,387 INFO org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor:
Scanned 0 directive(s) and 0 block(s) in 0 millisecond(s).
>> 2014-12-03 15:59:21,387 INFO org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor:
Rescanning after 30001 milliseconds
>> 2014-12-03 15:59:21,387 INFO org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor:
Scanned 0 directive(s) and 0 block(s) in 0 millisecond(s).
>> 2014-12-03 15:59:51,387 INFO org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor:
Rescanning after 30000 milliseconds
>> 2014-12-03 15:59:51,388 INFO org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor:
Scanned 0 directive(s) and 0 block(s) in 0 millisecond(s).
>> 2014-12-03 16:00:14,295 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* allocateBlock:
caught retry for allocation of a new block in /hbase/testnn/WALs/l-hbase3.dba.dev.cn0.qunar.com,60020,1417585992012/l-hbase3.dba.dev.cn0.qunar.com%2C60020%2C1417585992012.1417593301483.
Returning previously allocated block blk_1073743458_2634{blockUCState=UNDER_CONSTRUCTION,
primaryNodeIndex=-1, replicas=[]}
>> {log}
>> 
>> 
>> It seems the from 15:55:51 to 16:00:14 , all is org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor,
>> what is hadoop doing? how can i reduce the time cause 5 mins is too long!
>> 
>> 
>> 
>> On Dec 3, 2014, at 16:31, Harsh J <harsh@cloudera.com> wrote:
>> 
>> > What is your Hadoop version?
>> >
>> > On Wed, Dec 3, 2014 at 12:55 PM, mail list <louis.hust.ml@gmail.com> wrote:
>> >> hi all,
>> >>
>> >> Attach log again!
>> >>
>> >> The failover happened at about time: 2014-12-03 12:01:
>> >>
>> >>
>> >>
>> >>
>> >>
>> >> On Dec 3, 2014, at 14:55, mail list <louis.hust.ml@gmail.com> wrote:
>> >>
>> >>> Sorry forget the log, the failover time at about 2014-12-03 12:01:
>> >>>
>> >>> <hadoop-hadoop-namenode-l-hbase2.dba.dev.cn0.log.tar.gz>
>> >>> On Dec 3, 2014, at 14:48, mail list <louis.hust.ml@gmail.com>
wrote:
>> >>>
>> >>>> Hi all,
>> >>>>
>> >>>> I deploy the hadoop with 3 machines:
>> >>>>
>> >>>> l-hbase1.dba.dev.cn0 (namenode active and QJM)
>> >>>> l-hbase2.dba.dev.cn0 (namenode standby and datanode and QJM)
>> >>>> l-hbase3.dba.dev.cn0 (datanode and QJM)
>> >>>>
>> >>>> Above the hadoop, i deploy a hbase:
>> >>>> l-hbase1.dba.dev.cn0 (HMaster active)
>> >>>> l-hbase2.dba.dev.cn0 (HMaster standby)
>> >>>> l-hbase3.dba.dev.cn0 (RegionServer)
>> >>>>
>> >>>>
>> >>>> I write a program which put data into hbase one row every seconds
in a loop.
>> >>>> Then I use iptables to  simulate l-hbase1.dba.dev.cn0 offline,and
after that , the program hang and can not
>> >>>> write to hbase. After about 15 mins, the program can write again.
>> >>>>
>> >>>> The time 15mins for the HA failover is too long for me!
>> >>>> And I’ve no idea about the reason.
>> >>>>
>> >>>> Then I check the l-hbase2.dba.dev.cn0 namenode logs, and find many
retry like below:
>> >>>> {code}
>> >>>> 2014-12-03 12:13:35,165 INFO org.apache.hadoop.ipc.Client: Retrying
connect to server: l-hbase1.dba.dev.cn0/10.86.36.217:8485. Already tried 1 time(s); retry
policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
>> >>>> {code}
>> >>>>
>> >>>> I have the QJM on l-hbase1.dba.dev.cn0, does it matter?
>> >>>>
>> >>>> I am a newbie, Any idea will be appreciated!!
>> >>>
>> >>
>> >>
>> >
>> >
>> >
>> > --
>> > Harsh J
>> 
>> 
> 


Mime
View raw message