hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ray Chiang <rchi...@cloudera.com>
Subject Re: Question about the QJM HA namenode
Date Fri, 05 Dec 2014 05:42:28 GMT
For ZooKeeper fencing, you should see messages from ZKFailoverController in
the log.  For any other fencing, I'm not certain.  I'm not an HDFS expert,
but I saw something in your log that was easy to lookup in the code.

-Ray


On Thu, Dec 4, 2014 at 8:17 PM, mail list <louis.hust.ml@gmail.com> wrote:

> Hi ,Ray
>
> How can I know the standby name node become active and done the recovery
> job and can work in log?
> Is there some obvious mark in name node log?
>
>
> On Dec 5, 2014, at 9:55, mail list <louis.hust.ml@gmail.com> wrote:
>
> Thanks Ray, I will try this options.
>
> On Dec 5, 2014, at 5:50, Ray Chiang <rchiang@cloudera.com> wrote:
>
> It looks like that's tied to the ipc.client.connect.* properties.  You can
> adjust retries & timeout values to something shorter and see if that works
> for you.
>
> Offhand, I'm not certain if that will affect other services besides HDFS.
>
> -Ray
>
>
> On Wed, Dec 3, 2014 at 2:51 AM, mail list <louis.hust.ml@gmail.com> wrote:
>
>> hadoop-2.3.0-cdh5.1.0
>>
>> hi, i move QJM from the  l-hbase1.dba.dev.cn0 to another machine, and the
>> down time reduced to
>> 5 mins, and the log on the l-hbase2.dba.dev.cn0 like below:
>>
>> {log}
>> 2014-12-03 15:55:51,306 INFO
>> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer: Loaded 197 edits
>> starting from txid 6599
>> 2014-12-03 15:55:51,306 INFO
>> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Marking all
>> datandoes as stale
>> 2014-12-03 15:55:51,307 INFO
>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Reprocessing
>> replication and invalidation queues
>> 2014-12-03 15:55:51,307 INFO
>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: initializing
>> replication queues
>> 2014-12-03 15:55:51,307 INFO
>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Will take over writing
>> edit logs at txnid 6797
>> 2014-12-03 15:55:51,313 INFO
>> org.apache.hadoop.hdfs.server.namenode.FSEditLog: Starting log segment at
>> 6797
>> 2014-12-03 15:55:51,373 INFO
>> org.apache.hadoop.hdfs.server.namenode.FSEditLog: Number of transactions: 1
>> Total time for transactions(ms): 0 Number of transactions batched in Syncs:
>> 0 Number of syncs: 0 SyncTimes(ms): 0 9
>> 2014-12-03 15:55:51,385 INFO
>> org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor:
>> Starting CacheReplicationMonitor with interval 30000 milliseconds
>> 2014-12-03 15:55:51,385 INFO
>> org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor:
>> Rescanning because of pending operations
>> 2014-12-03 15:55:51,678 INFO org.apache.hadoop.fs.TrashPolicyDefault:
>> Namenode trash configuration: Deletion interval = 1440 minutes, Emptier
>> interval = 0 minutes.
>> 2014-12-03 15:55:51,679 INFO org.apache.hadoop.fs.TrashPolicyDefault: The
>> configured checkpoint interval is 0 minutes. Using an interval of 1440
>> minutes that is used for deletion instead
>> 2014-12-03 15:55:51,693 INFO
>> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Total number of
>> blocks            = 179
>> 2014-12-03 15:55:51,693 INFO
>> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Number of
>> invalid blocks          = 0
>> 2014-12-03 15:55:51,693 INFO
>> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Number of
>> under-replicated blocks = 0
>> 2014-12-03 15:55:51,693 INFO
>> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Number of
>> over-replicated blocks = 0
>> 2014-12-03 15:55:51,693 INFO
>> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Number of
>> blocks being written    = 4
>> 2014-12-03 15:55:51,693 INFO org.apache.hadoop.hdfs.StateChange: STATE*
>> Replication Queue initialization scan for invalid, over- and
>> under-replicated blocks completed in 386 msec
>> 2014-12-03 15:55:51,693 INFO
>> org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor:
>> Scanned 0 directive(s) and 0 block(s) in 308 millisecond(s).
>> 2014-12-03 15:56:21,385 INFO
>> org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor:
>> Rescanning after 30000 milliseconds
>> 2014-12-03 15:56:21,386 INFO
>> org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor:
>> Scanned 0 directive(s) and 0 block(s) in 0 millisecond(s).
>> 2014-12-03 15:56:51,386 INFO
>> org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor:
>> Rescanning after 30001 milliseconds
>> 2014-12-03 15:56:51,386 INFO
>> org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor:
>> Scanned 0 directive(s) and 0 block(s) in 0 millisecond(s).
>> 2014-12-03 15:57:21,387 INFO
>> org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor:
>> Rescanning after 30000 milliseconds
>> 2014-12-03 15:57:21,387 INFO
>> org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor:
>> Scanned 0 directive(s) and 0 block(s) in 1 millisecond(s).
>> 2014-12-03 15:57:51,386 INFO
>> org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor:
>> Rescanning after 30000 milliseconds
>> 2014-12-03 15:57:51,386 INFO
>> org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor:
>> Scanned 0 directive(s) and 0 block(s) in 0 millisecond(s).
>> 2014-12-03 15:58:21,387 INFO
>> org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor:
>> Rescanning after 30000 milliseconds
>> 2014-12-03 15:58:21,387 INFO
>> org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor:
>> Scanned 0 directive(s) and 0 block(s) in 1 millisecond(s).
>> 2014-12-03 15:58:51,386 INFO
>> org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor:
>> Rescanning after 30000 milliseconds
>> 2014-12-03 15:58:51,387 INFO
>> org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor:
>> Scanned 0 directive(s) and 0 block(s) in 0 millisecond(s).
>> 2014-12-03 15:59:21,387 INFO
>> org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor:
>> Rescanning after 30001 milliseconds
>> 2014-12-03 15:59:21,387 INFO
>> org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor:
>> Scanned 0 directive(s) and 0 block(s) in 0 millisecond(s).
>> 2014-12-03 15:59:51,387 INFO
>> org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor:
>> Rescanning after 30000 milliseconds
>> 2014-12-03 15:59:51,388 INFO
>> org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor:
>> Scanned 0 directive(s) and 0 block(s) in 0 millisecond(s).
>> 2014-12-03 16:00:14,295 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
>> allocateBlock: caught retry for allocation of a new block in
>> /hbase/testnn/WALs/l-hbase3.dba.dev.cn0.qunar.com,60020,1417585992012/
>> l-hbase3.dba.dev.cn0.qunar.com%2C60020%2C1417585992012.1417593301483.
>> Returning previously allocated block
>> blk_1073743458_2634{blockUCState=UNDER_CONSTRUCTION, primaryNodeIndex=-1,
>> replicas=[]}
>> {log}
>>
>>
>> It seems the from 15:55:51 to 16:00:14 , all is
>> org.apache.hadoop.hdfs.server.blockmanagement.CacheReplicationMonitor,
>> what is hadoop doing? how can i reduce the time cause 5 mins is too long!
>>
>>
>>
>> On Dec 3, 2014, at 16:31, Harsh J <harsh@cloudera.com> wrote:
>>
>> > What is your Hadoop version?
>> >
>> > On Wed, Dec 3, 2014 at 12:55 PM, mail list <louis.hust.ml@gmail.com>
>> wrote:
>> >> hi all,
>> >>
>> >> Attach log again!
>> >>
>> >> The failover happened at about time: 2014-12-03 12:01:
>> >>
>> >>
>> >>
>> >>
>> >>
>> >> On Dec 3, 2014, at 14:55, mail list <louis.hust.ml@gmail.com> wrote:
>> >>
>> >>> Sorry forget the log, the failover time at about 2014-12-03 12:01:
>> >>>
>> >>> <hadoop-hadoop-namenode-l-hbase2.dba.dev.cn0.log.tar.gz>
>> >>> On Dec 3, 2014, at 14:48, mail list <louis.hust.ml@gmail.com>
wrote:
>> >>>
>> >>>> Hi all,
>> >>>>
>> >>>> I deploy the hadoop with 3 machines:
>> >>>>
>> >>>> l-hbase1.dba.dev.cn0 (namenode active and QJM)
>> >>>> l-hbase2.dba.dev.cn0 (namenode standby and datanode and QJM)
>> >>>> l-hbase3.dba.dev.cn0 (datanode and QJM)
>> >>>>
>> >>>> Above the hadoop, i deploy a hbase:
>> >>>> l-hbase1.dba.dev.cn0 (HMaster active)
>> >>>> l-hbase2.dba.dev.cn0 (HMaster standby)
>> >>>> l-hbase3.dba.dev.cn0 (RegionServer)
>> >>>>
>> >>>>
>> >>>> I write a program which put data into hbase one row every seconds
in
>> a loop.
>> >>>> Then I use iptables to  simulate l-hbase1.dba.dev.cn0 offline,and
>> after that , the program hang and can not
>> >>>> write to hbase. After about 15 mins, the program can write again.
>> >>>>
>> >>>> The time 15mins for the HA failover is too long for me!
>> >>>> And I’ve no idea about the reason.
>> >>>>
>> >>>> Then I check the l-hbase2.dba.dev.cn0 namenode logs, and find many
>> retry like below:
>> >>>> {code}
>> >>>> 2014-12-03 12:13:35,165 INFO org.apache.hadoop.ipc.Client: Retrying
>> connect to server: l-hbase1.dba.dev.cn0/10.86.36.217:8485. Already tried
>> 1 time(s); retry policy is
>> RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000
>> MILLISECONDS)
>> >>>> {code}
>> >>>>
>> >>>> I have the QJM on l-hbase1.dba.dev.cn0, does it matter?
>> >>>>
>> >>>> I am a newbie, Any idea will be appreciated!!
>> >>>
>> >>
>> >>
>> >
>> >
>> >
>> > --
>> > Harsh J
>>
>>
>
>
>

Mime
View raw message