hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From tsuna <tsuna...@gmail.com>
Subject Re: All RegionServers stuck on BadVersion from ZK after cluster restart
Date Wed, 27 Jan 2016 16:44:47 GMT
Some NN logs attached.

On Wed, Jan 27, 2016 at 8:42 AM, tsuna <tsunanet@gmail.com> wrote:
> On Wed, Jan 27, 2016 at 3:31 AM, Stack <stack@duboce.net> wrote:
>> Can you try what Samir suggests? Can replay moved-aside logs using
>> WALPlayer after cluster comes up.
>
> OK I’ll shut down the HM and all RS’es and clear the data in ZK and
> restart everything and see how things go.
>
>> I see that at first we are making progress through the WALs ("... total
>> tasks = 7..." ... it was 9 in your master log) but maybe they are getting
>> stuck on particular WALs.  The stuck Master has same task list over last 24
>> hours?
>
> It’s progressing veeerrryyyy slooowly.  I now have 34 regions online
> (that’s 3 more than 10h ago! :P)
>
>> On the RS log, I see you are stuck trying to recover 1453728374800
>>
>> We are trying to recover the lease on the file from NN. Below looks bad.
>> Over a minute and still not lease recovery.
>>
>> 140 2016-01-27 05:53:10,338 INFO  [RS_LOG_REPLAY_OPS-r12s4:9104-0]
>> util.FSHDFSUtils: recoverLease=false, attempt=2 on file=hdfs://
>> 0.namenode.sjc.aristanetworks.com:8820/hbase/WALs/r12s16.sjc.aristanetworks.com,9104,1452811286456-splitting/r12s16.sjc.aristanetworks.com%2C910
>>    4%2C1452811286456.default.1453728374800 after 68218ms
>>
>> NN showing us stuck on lease recovery?
>>
>> NN went down during power outage?
>
> Yes, everything went down.
>
> The NN regularly shows this:
>
> 16/01/27 16:33:39 INFO namenode.FSNamesystem: recoverLease: [Lease.
> Holder: DFSClient_NONMAPREDUCE_174538359_1, pendingcreates: 2],
> src=/hbase/WALs/r12s1.sjc.aristanetworks.com,9104,1452811288618-splitting/r12s1.sjc.aristanetworks.com%2C9104%2C1452811288618.default.1453728791276
> from client DFSClient_NONMAPREDUCE_174538359_1
> 16/01/27 16:33:39 INFO namenode.FSNamesystem: Recovering [Lease.
> Holder: DFSClient_NONMAPREDUCE_174538359_1, pendingcreates: 2],
> src=/hbase/WALs/r12s1.sjc.aristanetworks.com,9104,1452811288618-splitting/r12s1.sjc.aristanetworks.com%2C9104%2C1452811288618.default.1453728791276
> 16/01/27 16:33:39 WARN BlockStateChange: BLOCK*
> BlockInfoUnderConstruction.initLeaseRecovery: No blocks found, lease
> removed.
> 16/01/27 16:33:39 WARN hdfs.StateChange: DIR*
> NameSystem.internalReleaseLease: File
> /hbase/WALs/r12s1.sjc.aristanetworks.com,9104,1452811288618-splitting/r12s1.sjc.aristanetworks.com%2C9104%2C1452811288618.default.1453728791276
> has not been closed. Lease recovery is in progress. RecoveryId =
> 1122295 for block blk_1074847505_1108430{UCState=UNDER_RECOVERY,
> truncateBlock=null, primaryNodeIndex=-1, replicas=[]}
>
> I should perhaps mention at this point that as a result of the power
> outage we lost a bunch of hard disk drives, and this caused us to
> permanently lose approximately 0.3% of the blocks in HDFS.  I know
> that at least one of WALs was affected by data loss
> (/hbase/WALs/r12s12.sjc.aristanetworks.com,9104,1452811287315-splitting/r12s12.sjc.aristanetworks.com%2C9104%2C1452811287315.default.1453724924825).
> Might be worth mentioning also that 94% of the 0.3% blocks we lost are
> all from one unlucky region.
>
>> The BadVersion is probably because task has been assumed by another because
>> lease recovery taking too long.
>>
>> I see 1453728374800 is still in the list you paste from Master later in
>> this thread.
>>
>> Try moving it aside. Might have to restart master for it to notice the
>> move-aside... Try not restarting first.
>
> --
> Benoit "tsuna" Sigoure



-- 
Benoit "tsuna" Sigoure

Mime
View raw message