hadoop-hdfs-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ray Chiang <rchi...@cloudera.com>
Subject Re: Question about the QJM HA namenode
Date Thu, 04 Dec 2014 21:50:25 GMT
It looks like that's tied to the ipc.client.connect.* properties.  You can
adjust retries & timeout values to something shorter and see if that works
for you.

Offhand, I'm not certain if that will affect other services besides HDFS.

-Ray


On Wed, Dec 3, 2014 at 2:51 AM, mail list <louis.hust.ml@gmail.com> wrote:

> hadoop-2.3.0-cdh5.1.0
>
> hi, i move QJM from the  l-hbase1.dba.dev.cn0 to another machine, and the
> down time reduced to
> 5 mins, and the log on the l-hbase2.dba.dev.cn0 like below:
>
> {log}
> 2014-12-03 15:55:51,306 INFO
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer: Loaded 197 edits
> starting from txid 6599
> 2014-12-03 15:55:51,306 INFO
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Marking all
> datandoes as stale
> 2014-12-03 15:55:51,307 INFO
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Reprocessing
> replication and invalidation queues
> 2014-12-03 15:55:51,307 INFO
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: initializing
> replication queues
> 2014-12-03 15:55:51,307 INFO
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Will take over writing
> edit logs at txnid 6797
> 2014-12-03 15:55:51,313 INFO
> org.apache.hadoop.hdfs.server.namenode.FSEditLog: Starting log segment at
> 6797
> 2014-12-03 15:55:51,373 INFO
> org.apache.hadoop.hdfs.server.namenode.FSEditLog: Number of transactions: 1
> Total time for transactions(ms): 0 Number of transactions batched in Syncs:
> 0 Number of syncs: 0 SyncTimes(ms): 0 9
> 2014-12-03 15:55:51,385 INFO
> org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor:
> Starting CacheReplicationMonitor with interval 30000 milliseconds
> 2014-12-03 15:55:51,385 INFO
> org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor:
> Rescanning because of pending operations
> 2014-12-03 15:55:51,678 INFO org.apache.hadoop.fs.TrashPolicyDefault:
> Namenode trash configuration: Deletion interval = 1440 minutes, Emptier
> interval = 0 minutes.
> 2014-12-03 15:55:51,679 INFO org.apache.hadoop.fs.TrashPolicyDefault: The
> configured checkpoint interval is 0 minutes. Using an interval of 1440
> minutes that is used for deletion instead
> 2014-12-03 15:55:51,693 INFO
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Total number of
> blocks            = 179
> 2014-12-03 15:55:51,693 INFO
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Number of
> invalid blocks          = 0
> 2014-12-03 15:55:51,693 INFO
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Number of
> under-replicated blocks = 0
> 2014-12-03 15:55:51,693 INFO
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Number of
> over-replicated blocks = 0
> 2014-12-03 15:55:51,693 INFO
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Number of
> blocks being written    = 4
> 2014-12-03 15:55:51,693 INFO org.apache.hadoop.hdfs.StateChange: STATE*
> Replication Queue initialization scan for invalid, over- and
> under-replicated blocks completed in 386 msec
> 2014-12-03 15:55:51,693 INFO
> org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor:
> Scanned 0 directive(s) and 0 block(s) in 308 millisecond(s).
> 2014-12-03 15:56:21,385 INFO
> org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor:
> Rescanning after 30000 milliseconds
> 2014-12-03 15:56:21,386 INFO
> org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor:
> Scanned 0 directive(s) and 0 block(s) in 0 millisecond(s).
> 2014-12-03 15:56:51,386 INFO
> org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor:
> Rescanning after 30001 milliseconds
> 2014-12-03 15:56:51,386 INFO
> org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor:
> Scanned 0 directive(s) and 0 block(s) in 0 millisecond(s).
> 2014-12-03 15:57:21,387 INFO
> org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor:
> Rescanning after 30000 milliseconds
> 2014-12-03 15:57:21,387 INFO
> org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor:
> Scanned 0 directive(s) and 0 block(s) in 1 millisecond(s).
> 2014-12-03 15:57:51,386 INFO
> org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor:
> Rescanning after 30000 milliseconds
> 2014-12-03 15:57:51,386 INFO
> org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor:
> Scanned 0 directive(s) and 0 block(s) in 0 millisecond(s).
> 2014-12-03 15:58:21,387 INFO
> org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor:
> Rescanning after 30000 milliseconds
> 2014-12-03 15:58:21,387 INFO
> org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor:
> Scanned 0 directive(s) and 0 block(s) in 1 millisecond(s).
> 2014-12-03 15:58:51,386 INFO
> org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor:
> Rescanning after 30000 milliseconds
> 2014-12-03 15:58:51,387 INFO
> org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor:
> Scanned 0 directive(s) and 0 block(s) in 0 millisecond(s).
> 2014-12-03 15:59:21,387 INFO
> org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor:
> Rescanning after 30001 milliseconds
> 2014-12-03 15:59:21,387 INFO
> org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor:
> Scanned 0 directive(s) and 0 block(s) in 0 millisecond(s).
> 2014-12-03 15:59:51,387 INFO
> org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor:
> Rescanning after 30000 milliseconds
> 2014-12-03 15:59:51,388 INFO
> org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor:
> Scanned 0 directive(s) and 0 block(s) in 0 millisecond(s).
> 2014-12-03 16:00:14,295 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
> allocateBlock: caught retry for allocation of a new block in
> /hbase/testnn/WALs/l-hbase3.dba.dev.cn0.qunar.com,60020,1417585992012/
> l-hbase3.dba.dev.cn0.qunar.com%2C60020%2C1417585992012.1417593301483.
> Returning previously allocated block
> blk_1073743458_2634{blockUCState=UNDER_CONSTRUCTION, primaryNodeIndex=-1,
> replicas=[]}
> {log}
>
>
> It seems the from 15:55:51 to 16:00:14 , all is
> org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor,
> what is hadoop doing? how can i reduce the time cause 5 mins is too long!
>
>
>
> On Dec 3, 2014, at 16:31, Harsh J <harsh@cloudera.com> wrote:
>
> > What is your Hadoop version?
> >
> > On Wed, Dec 3, 2014 at 12:55 PM, mail list <louis.hust.ml@gmail.com>
> wrote:
> >> hi all,
> >>
> >> Attach log again!
> >>
> >> The failover happened at about time: 2014-12-03 12:01:
> >>
> >>
> >>
> >>
> >>
> >> On Dec 3, 2014, at 14:55, mail list <louis.hust.ml@gmail.com> wrote:
> >>
> >>> Sorry forget the log, the failover time at about 2014-12-03 12:01:
> >>>
> >>> <hadoop-hadoop-namenode-l-hbase2.dba.dev.cn0.log.tar.gz>
> >>> On Dec 3, 2014, at 14:48, mail list <louis.hust.ml@gmail.com> wrote:
> >>>
> >>>> Hi all,
> >>>>
> >>>> I deploy the hadoop with 3 machines:
> >>>>
> >>>> l-hbase1.dba.dev.cn0 (namenode active and QJM)
> >>>> l-hbase2.dba.dev.cn0 (namenode standby and datanode and QJM)
> >>>> l-hbase3.dba.dev.cn0 (datanode and QJM)
> >>>>
> >>>> Above the hadoop, i deploy a hbase:
> >>>> l-hbase1.dba.dev.cn0 (HMaster active)
> >>>> l-hbase2.dba.dev.cn0 (HMaster standby)
> >>>> l-hbase3.dba.dev.cn0 (RegionServer)
> >>>>
> >>>>
> >>>> I write a program which put data into hbase one row every seconds in
> a loop.
> >>>> Then I use iptables to  simulate l-hbase1.dba.dev.cn0 offline,and
> after that , the program hang and can not
> >>>> write to hbase. After about 15 mins, the program can write again.
> >>>>
> >>>> The time 15mins for the HA failover is too long for me!
> >>>> And I’ve no idea about the reason.
> >>>>
> >>>> Then I check the l-hbase2.dba.dev.cn0 namenode logs, and find many
> retry like below:
> >>>> {code}
> >>>> 2014-12-03 12:13:35,165 INFO org.apache.hadoop.ipc.Client: Retrying
> connect to server: l-hbase1.dba.dev.cn0/10.86.36.217:8485. Already tried
> 1 time(s); retry policy is
> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000
> MILLISECONDS)
> >>>> {code}
> >>>>
> >>>> I have the QJM on l-hbase1.dba.dev.cn0, does it matter?
> >>>>
> >>>> I am a newbie, Any idea will be appreciated!!
> >>>
> >>
> >>
> >
> >
> >
> > --
> > Harsh J
>
>

Mime
View raw message