hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From mail list <louis.hust...@gmail.com>
Subject Re: Question about the QJM HA namenode
Date Fri, 05 Dec 2014 06:31:49 GMT
ZKFC make the standby name node active quickly(about within one minute).

On Dec 5, 2014, at 13:42, Ray Chiang <rchiang@cloudera.com> wrote:

> For ZooKeeper fencing, you should see messages from ZKFailoverController in the log.
 For any other fencing, I'm not certain.  I'm not an HDFS expert, but I saw something in your
log that was easy to lookup in the code.
> 
> -Ray
> 
> 
> On Thu, Dec 4, 2014 at 8:17 PM, mail list <louis.hust.ml@gmail.com> wrote:
> Hi ,Ray
> 
> How can I know the standby name node become active and done the recovery job and can
work in log?
> Is there some obvious mark in name node log?
> 
> 
> On Dec 5, 2014, at 9:55, mail list <louis.hust.ml@gmail.com> wrote:
> 
>> Thanks Ray, I will try this options.
>> 
>> On Dec 5, 2014, at 5:50, Ray Chiang <rchiang@cloudera.com> wrote:
>> 
>>> It looks like that's tied to the ipc.client.connect.* properties.  You can adjust
retries & timeout values to something shorter and see if that works for you.
>>> 
>>> Offhand, I'm not certain if that will affect other services besides HDFS.
>>> 
>>> -Ray
>>> 
>>> 
>>> On Wed, Dec 3, 2014 at 2:51 AM, mail list <louis.hust.ml@gmail.com> wrote:
>>> hadoop-2.3.0-cdh5.1.0
>>> 
>>> hi, i move QJM from the  l-hbase1.dba.dev.cn0 to another machine, and the down
time reduced to
>>> 5 mins, and the log on the l-hbase2.dba.dev.cn0 like below:
>>> 
>>> {log}
>>> 2014-12-03 15:55:51,306 INFO org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer:
Loaded 197 edits starting from txid 6599
>>> 2014-12-03 15:55:51,306 INFO org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager:
Marking all datandoes as stale
>>> 2014-12-03 15:55:51,307 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
Reprocessing replication and invalidation queues
>>> 2014-12-03 15:55:51,307 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
initializing replication queues
>>> 2014-12-03 15:55:51,307 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
Will take over writing edit logs at txnid 6797
>>> 2014-12-03 15:55:51,313 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog:
Starting log segment at 6797
>>> 2014-12-03 15:55:51,373 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog:
Number of transactions: 1 Total time for transactions(ms): 0 Number of transactions batched
in Syncs: 0 Number of syncs: 0 SyncTimes(ms): 0 9
>>> 2014-12-03 15:55:51,385 INFO org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor:
Starting CacheReplicationMonitor with interval 30000 milliseconds
>>> 2014-12-03 15:55:51,385 INFO org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor:
Rescanning because of pending operations
>>> 2014-12-03 15:55:51,678 INFO org.apache.hadoop.fs.TrashPolicyDefault: Namenode
trash configuration: Deletion interval = 1440 minutes, Emptier interval = 0 minutes.
>>> 2014-12-03 15:55:51,679 INFO org.apache.hadoop.fs.TrashPolicyDefault: The configured
checkpoint interval is 0 minutes. Using an interval of 1440 minutes that is used for deletion
instead
>>> 2014-12-03 15:55:51,693 INFO org.apache.hadoop.hdfs.server.blockmanagement.BlockManager:
Total number of blocks            = 179
>>> 2014-12-03 15:55:51,693 INFO org.apache.hadoop.hdfs.server.blockmanagement.BlockManager:
Number of invalid blocks          = 0
>>> 2014-12-03 15:55:51,693 INFO org.apache.hadoop.hdfs.server.blockmanagement.BlockManager:
Number of under-replicated blocks = 0
>>> 2014-12-03 15:55:51,693 INFO org.apache.hadoop.hdfs.server.blockmanagement.BlockManager:
Number of  over-replicated blocks = 0
>>> 2014-12-03 15:55:51,693 INFO org.apache.hadoop.hdfs.server.blockmanagement.BlockManager:
Number of blocks being written    = 4
>>> 2014-12-03 15:55:51,693 INFO org.apache.hadoop.hdfs.StateChange: STATE* Replication
Queue initialization scan for invalid, over- and under-replicated blocks completed in 386
msec
>>> 2014-12-03 15:55:51,693 INFO org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor:
Scanned 0 directive(s) and 0 block(s) in 308 millisecond(s).
>>> 2014-12-03 15:56:21,385 INFO org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor:
Rescanning after 30000 milliseconds
>>> 2014-12-03 15:56:21,386 INFO org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor:
Scanned 0 directive(s) and 0 block(s) in 0 millisecond(s).
>>> 2014-12-03 15:56:51,386 INFO org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor:
Rescanning after 30001 milliseconds
>>> 2014-12-03 15:56:51,386 INFO org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor:
Scanned 0 directive(s) and 0 block(s) in 0 millisecond(s).
>>> 2014-12-03 15:57:21,387 INFO org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor:
Rescanning after 30000 milliseconds
>>> 2014-12-03 15:57:21,387 INFO org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor:
Scanned 0 directive(s) and 0 block(s) in 1 millisecond(s).
>>> 2014-12-03 15:57:51,386 INFO org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor:
Rescanning after 30000 milliseconds
>>> 2014-12-03 15:57:51,386 INFO org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor:
Scanned 0 directive(s) and 0 block(s) in 0 millisecond(s).
>>> 2014-12-03 15:58:21,387 INFO org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor:
Rescanning after 30000 milliseconds
>>> 2014-12-03 15:58:21,387 INFO org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor:
Scanned 0 directive(s) and 0 block(s) in 1 millisecond(s).
>>> 2014-12-03 15:58:51,386 INFO org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor:
Rescanning after 30000 milliseconds
>>> 2014-12-03 15:58:51,387 INFO org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor:
Scanned 0 directive(s) and 0 block(s) in 0 millisecond(s).
>>> 2014-12-03 15:59:21,387 INFO org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor:
Rescanning after 30001 milliseconds
>>> 2014-12-03 15:59:21,387 INFO org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor:
Scanned 0 directive(s) and 0 block(s) in 0 millisecond(s).
>>> 2014-12-03 15:59:51,387 INFO org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor:
Rescanning after 30000 milliseconds
>>> 2014-12-03 15:59:51,388 INFO org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor:
Scanned 0 directive(s) and 0 block(s) in 0 millisecond(s).
>>> 2014-12-03 16:00:14,295 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* allocateBlock:
caught retry for allocation of a new block in /hbase/testnn/WALs/l-hbase3.dba.dev.cn0.qunar.com,60020,1417585992012/l-hbase3.dba.dev.cn0.qunar.com%2C60020%2C1417585992012.1417593301483.
Returning previously allocated block blk_1073743458_2634{blockUCState=UNDER_CONSTRUCTION,
primaryNodeIndex=-1, replicas=[]}
>>> {log}
>>> 
>>> 
>>> It seems the from 15:55:51 to 16:00:14 , all is org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor,
>>> what is hadoop doing? how can i reduce the time cause 5 mins is too long!
>>> 
>>> 
>>> 
>>> On Dec 3, 2014, at 16:31, Harsh J <harsh@cloudera.com> wrote:
>>> 
>>> > What is your Hadoop version?
>>> >
>>> > On Wed, Dec 3, 2014 at 12:55 PM, mail list <louis.hust.ml@gmail.com>
wrote:
>>> >> hi all,
>>> >>
>>> >> Attach log again!
>>> >>
>>> >> The failover happened at about time: 2014-12-03 12:01:
>>> >>
>>> >>
>>> >>
>>> >>
>>> >>
>>> >> On Dec 3, 2014, at 14:55, mail list <louis.hust.ml@gmail.com>
wrote:
>>> >>
>>> >>> Sorry forget the log, the failover time at about 2014-12-03 12:01:
>>> >>>
>>> >>> <hadoop-hadoop-namenode-l-hbase2.dba.dev.cn0.log.tar.gz>
>>> >>> On Dec 3, 2014, at 14:48, mail list <louis.hust.ml@gmail.com>
wrote:
>>> >>>
>>> >>>> Hi all,
>>> >>>>
>>> >>>> I deploy the hadoop with 3 machines:
>>> >>>>
>>> >>>> l-hbase1.dba.dev.cn0 (namenode active and QJM)
>>> >>>> l-hbase2.dba.dev.cn0 (namenode standby and datanode and QJM)
>>> >>>> l-hbase3.dba.dev.cn0 (datanode and QJM)
>>> >>>>
>>> >>>> Above the hadoop, i deploy a hbase:
>>> >>>> l-hbase1.dba.dev.cn0 (HMaster active)
>>> >>>> l-hbase2.dba.dev.cn0 (HMaster standby)
>>> >>>> l-hbase3.dba.dev.cn0 (RegionServer)
>>> >>>>
>>> >>>>
>>> >>>> I write a program which put data into hbase one row every seconds
in a loop.
>>> >>>> Then I use iptables to  simulate l-hbase1.dba.dev.cn0 offline,and
after that , the program hang and can not
>>> >>>> write to hbase. After about 15 mins, the program can write again.
>>> >>>>
>>> >>>> The time 15mins for the HA failover is too long for me!
>>> >>>> And I’ve no idea about the reason.
>>> >>>>
>>> >>>> Then I check the l-hbase2.dba.dev.cn0 namenode logs, and find
many retry like below:
>>> >>>> {code}
>>> >>>> 2014-12-03 12:13:35,165 INFO org.apache.hadoop.ipc.Client: Retrying
connect to server: l-hbase1.dba.dev.cn0/10.86.36.217:8485. Already tried 1 time(s); retry
policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
>>> >>>> {code}
>>> >>>>
>>> >>>> I have the QJM on l-hbase1.dba.dev.cn0, does it matter?
>>> >>>>
>>> >>>> I am a newbie, Any idea will be appreciated!!
>>> >>>
>>> >>
>>> >>
>>> >
>>> >
>>> >
>>> > --
>>> > Harsh J
>>> 
>>> 
>> 
> 
> 


Mime
View raw message