Return-Path: X-Original-To: apmail-hadoop-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 9A2EB10092 for ; Wed, 3 Dec 2014 10:51:46 +0000 (UTC) Received: (qmail 10124 invoked by uid 500); 3 Dec 2014 10:51:40 -0000 Delivered-To: apmail-hadoop-user-archive@hadoop.apache.org Received: (qmail 10010 invoked by uid 500); 3 Dec 2014 10:51:40 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 10000 invoked by uid 99); 3 Dec 2014 10:51:40 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 03 Dec 2014 10:51:40 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of louis.hust.ml@gmail.com designates 209.85.192.178 as permitted sender) Received: from [209.85.192.178] (HELO mail-pd0-f178.google.com) (209.85.192.178) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 03 Dec 2014 10:51:35 +0000 Received: by mail-pd0-f178.google.com with SMTP id g10so15234527pdj.37 for ; Wed, 03 Dec 2014 02:51:15 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=content-type:mime-version:subject:from:in-reply-to:date :content-transfer-encoding:message-id:references:to; bh=LIAGw0KEd1NiIapY18efWEBcBTkwCugwMzVFcJkEWyc=; b=fFQeGZshoWiv+7mASGrnlDnfPGCRCdBnnPwNdJySSu8MooAOcp14Q84zESu4PR/JEl IFRKlNRpqEIEPuu1YTsKH/W/0iWb72gsBsVvtFFo0ZZjJm3GjYRZh7ce+WT15efS1icf YGEOIV+s19+30jBwZkO5iPLOhW/+Jrnmf5u9qe1uHAVD/fKQNBpI6lf7zaKiruo8n6vv GFx/h+EobZaHrAOumsT8l1ipqGCj0XXuML68i0yqlBLObNTMZvsLfz1CCEGjYgXvMxmk fi+rcy2HANciFxl2aHEOlBDrtM7YHku4lrt4HmbPFQILBSaJ5XVOMyeqFFqysun1YQfC n70g== X-Received: by 10.68.195.41 with SMTP id ib9mr14492256pbc.15.1417603874833; Wed, 03 Dec 2014 02:51:14 -0800 (PST) Received: from [192.168.126.70] ([211.151.238.52]) by mx.google.com with ESMTPSA id ju4sm22658841pbc.81.2014.12.03.02.51.13 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Wed, 03 Dec 2014 02:51:14 -0800 (PST) Content-Type: text/plain; charset=utf-8 Mime-Version: 1.0 (Mac OS X Mail 7.3 \(1878.6\)) Subject: Re: Question about the QJM HA namenode From: mail list In-Reply-To: Date: Wed, 3 Dec 2014 18:51:10 +0800 Content-Transfer-Encoding: quoted-printable Message-Id: <4DDB6FBC-8B36-4F08-895E-A195F4D95E0A@gmail.com> References: <1B699136-87C4-4E37-B752-9F707A9422E8@gmail.com> <29C5FDF4-3A09-47BF-8536-7BF0FD1C6F20@gmail.com> <383ECC5E-0010-4B23-9DE6-0DD3D7916971@gmail.com> To: user@hadoop.apache.org X-Mailer: Apple Mail (2.1878.6) X-Virus-Checked: Checked by ClamAV on apache.org hadoop-2.3.0-cdh5.1.0 hi, i move QJM from the l-hbase1.dba.dev.cn0 to another machine, and = the down time reduced to=20 5 mins, and the log on the l-hbase2.dba.dev.cn0 like below: {log} 2014-12-03 15:55:51,306 INFO = org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer: Loaded 197 = edits starting from txid 6599 2014-12-03 15:55:51,306 INFO = org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Marking = all datandoes as stale 2014-12-03 15:55:51,307 INFO = org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Reprocessing = replication and invalidation queues 2014-12-03 15:55:51,307 INFO = org.apache.hadoop.hdfs.server.namenode.FSNamesystem: initializing = replication queues 2014-12-03 15:55:51,307 INFO = org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Will take over = writing edit logs at txnid 6797 2014-12-03 15:55:51,313 INFO = org.apache.hadoop.hdfs.server.namenode.FSEditLog: Starting log segment = at 6797 2014-12-03 15:55:51,373 INFO = org.apache.hadoop.hdfs.server.namenode.FSEditLog: Number of = transactions: 1 Total time for transactions(ms): 0 Number of = transactions batched in Syncs: 0 Number of syncs: 0 SyncTimes(ms): 0 9 2014-12-03 15:55:51,385 INFO = org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: = Starting CacheReplicationMonitor with interval 30000 milliseconds 2014-12-03 15:55:51,385 INFO = org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: = Rescanning because of pending operations 2014-12-03 15:55:51,678 INFO org.apache.hadoop.fs.TrashPolicyDefault: = Namenode trash configuration: Deletion interval =3D 1440 minutes, = Emptier interval =3D 0 minutes. 2014-12-03 15:55:51,679 INFO org.apache.hadoop.fs.TrashPolicyDefault: = The configured checkpoint interval is 0 minutes. Using an interval of = 1440 minutes that is used for deletion instead 2014-12-03 15:55:51,693 INFO = org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Total number = of blocks =3D 179 2014-12-03 15:55:51,693 INFO = org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Number of = invalid blocks =3D 0 2014-12-03 15:55:51,693 INFO = org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Number of = under-replicated blocks =3D 0 2014-12-03 15:55:51,693 INFO = org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Number of = over-replicated blocks =3D 0 2014-12-03 15:55:51,693 INFO = org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Number of = blocks being written =3D 4 2014-12-03 15:55:51,693 INFO org.apache.hadoop.hdfs.StateChange: STATE* = Replication Queue initialization scan for invalid, over- and = under-replicated blocks completed in 386 msec 2014-12-03 15:55:51,693 INFO = org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: = Scanned 0 directive(s) and 0 block(s) in 308 millisecond(s). 2014-12-03 15:56:21,385 INFO = org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: = Rescanning after 30000 milliseconds 2014-12-03 15:56:21,386 INFO = org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: = Scanned 0 directive(s) and 0 block(s) in 0 millisecond(s). 2014-12-03 15:56:51,386 INFO = org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: = Rescanning after 30001 milliseconds 2014-12-03 15:56:51,386 INFO = org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: = Scanned 0 directive(s) and 0 block(s) in 0 millisecond(s). 2014-12-03 15:57:21,387 INFO = org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: = Rescanning after 30000 milliseconds 2014-12-03 15:57:21,387 INFO = org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: = Scanned 0 directive(s) and 0 block(s) in 1 millisecond(s). 2014-12-03 15:57:51,386 INFO = org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: = Rescanning after 30000 milliseconds 2014-12-03 15:57:51,386 INFO = org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: = Scanned 0 directive(s) and 0 block(s) in 0 millisecond(s). 2014-12-03 15:58:21,387 INFO = org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: = Rescanning after 30000 milliseconds 2014-12-03 15:58:21,387 INFO = org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: = Scanned 0 directive(s) and 0 block(s) in 1 millisecond(s). 2014-12-03 15:58:51,386 INFO = org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: = Rescanning after 30000 milliseconds 2014-12-03 15:58:51,387 INFO = org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: = Scanned 0 directive(s) and 0 block(s) in 0 millisecond(s). 2014-12-03 15:59:21,387 INFO = org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: = Rescanning after 30001 milliseconds 2014-12-03 15:59:21,387 INFO = org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: = Scanned 0 directive(s) and 0 block(s) in 0 millisecond(s). 2014-12-03 15:59:51,387 INFO = org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: = Rescanning after 30000 milliseconds 2014-12-03 15:59:51,388 INFO = org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor: = Scanned 0 directive(s) and 0 block(s) in 0 millisecond(s). 2014-12-03 16:00:14,295 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* = allocateBlock: caught retry for allocation of a new block in = /hbase/testnn/WALs/l-hbase3.dba.dev.cn0.qunar.com,60020,1417585992012/l-hb= ase3.dba.dev.cn0.qunar.com%2C60020%2C1417585992012.1417593301483. = Returning previously allocated block = blk_1073743458_2634{blockUCState=3DUNDER_CONSTRUCTION, = primaryNodeIndex=3D-1, replicas=3D[]} {log} It seems the from 15:55:51 to 16:00:14 , all is = org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor,=20= what is hadoop doing? how can i reduce the time cause 5 mins is too = long! On Dec 3, 2014, at 16:31, Harsh J wrote: > What is your Hadoop version? >=20 > On Wed, Dec 3, 2014 at 12:55 PM, mail list = wrote: >> hi all, >>=20 >> Attach log again! >>=20 >> The failover happened at about time: 2014-12-03 12:01: >>=20 >>=20 >>=20 >>=20 >>=20 >> On Dec 3, 2014, at 14:55, mail list wrote: >>=20 >>> Sorry forget the log, the failover time at about 2014-12-03 12:01: >>>=20 >>> >>> On Dec 3, 2014, at 14:48, mail list wrote: >>>=20 >>>> Hi all, >>>>=20 >>>> I deploy the hadoop with 3 machines: >>>>=20 >>>> l-hbase1.dba.dev.cn0 (namenode active and QJM) >>>> l-hbase2.dba.dev.cn0 (namenode standby and datanode and QJM) >>>> l-hbase3.dba.dev.cn0 (datanode and QJM) >>>>=20 >>>> Above the hadoop, i deploy a hbase: >>>> l-hbase1.dba.dev.cn0 (HMaster active) >>>> l-hbase2.dba.dev.cn0 (HMaster standby) >>>> l-hbase3.dba.dev.cn0 (RegionServer) >>>>=20 >>>>=20 >>>> I write a program which put data into hbase one row every seconds = in a loop. >>>> Then I use iptables to simulate l-hbase1.dba.dev.cn0 offline=EF=BC=8C= and after that , the program hang and can not >>>> write to hbase. After about 15 mins, the program can write again. >>>>=20 >>>> The time 15mins for the HA failover is too long for me! >>>> And I=E2=80=99ve no idea about the reason. >>>>=20 >>>> Then I check the l-hbase2.dba.dev.cn0 namenode logs, and find many = retry like below: >>>> {code} >>>> 2014-12-03 12:13:35,165 INFO org.apache.hadoop.ipc.Client: Retrying = connect to server: l-hbase1.dba.dev.cn0/10.86.36.217:8485. Already tried = 1 time(s); retry policy is = RetryUpToMaximumCountWithFixedSleep(maxRetries=3D10, sleepTime=3D1000 = MILLISECONDS) >>>> {code} >>>>=20 >>>> I have the QJM on l-hbase1.dba.dev.cn0, does it matter? >>>>=20 >>>> I am a newbie, Any idea will be appreciated!! >>>=20 >>=20 >>=20 >=20 >=20 >=20 > --=20 > Harsh J