hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From jackie macmillian <jackie.macmill...@gmail.com>
Subject Ghost Regions Problem
Date Tue, 04 Aug 2020 11:46:54 GMT
Hi all,

we have a cluster with hbase 2.2.0 installed on hadoop 2.9.2.
a few weeks ago, we had some issues on our active/standby namenode
selection due to some network problems and their zkfc services' competition
to select the active namenode. as a result, both our namenodes became
active for a short time and all region server services restarted
themselves. we achieved to solve that issue with some arrangements on
timeout parameters. but the story began afterwards.
after the region servers completed their reset tasks, we saw that all our
hbase tables became unstable. for example, think about a 200 regions-wide
table. 196 regions of that table got online, but 4 regions stuck at an
intermediate state like closing/opening. at the end, the tables stuck at
disabling/enabling states. on the other hand, hbase had lots of procedure
locks and masterprocwals directory kept enlarging.
to overcome that issue, i used hbck2 to release stuck regions and once i
managed to enable the table, i created an empty copy of that table from its
descriptor and bulk loaded all hfiles of that corrupt table to the new one.
at this point, you would ask why i did not use that enabled table. i
couldn't because although i was able to bypass the locked procedures there
were so many of them to resolve one by one. if you use hbck2 to bypass
those locks but leave them as they are, it would be only a cosmetic move,
regions won't become online in real. so i thought it would be much more
faster to create a brand new one and load all the data to that table. bulk
load was successful and the new table became online and scannable. the next
point was to disable the old one and drop it. but, as hmaster was dealing
lots of locks and procedures, i wasn't able to disable the old table. some
regions remain in disabling state again. so i decided to set that table's
state to disabled with hbck2 and then i succeeded to drop them.
after i put all my tables to online and all my old tables dropped
successfully, masterprocwals was the last stop to a clean hbase, i thought
:) i moved aside masterprocwals directory and restarted the active master.
the new master took control and voila! master procedures & locks became
clear, and all my tables were online as needed! i scanned hbase:meta table
and saw there is no other regions than the ones online.
until now.. remember those regions who were stuck and forced to close to
disable and drop the tables? when a region server is crashed and restarted
for some reason now, those regions are tried to be assigned by the master
to region servers. but region servers decline that assignment as there is
no table descriptor for those regions. take a look at HBASE-22780
<https://issues.apache.org/jira/browse/HBASE-22780>. exactly the same
problem is issued here.
i tried to create a 1-regioned table with the same name as the old table.
it succeeded. and the ghost region followed that table. then disabled and
dropped them again successfully. and again explored that hbase:meta doesn't
have that region anymore. but after a region server crash it comes again
from nowhere. so i figured out that when a region server comes down hmaster
does not read hbase:meta table to assign that server's regions to other
servers. i've read that master processes have some in-memory representation
of hbase:meta table in order to perform assignment issues as fast as
possible. i would clean hbase:meta from those ghost regions as explained,
but i have to force the masters to get this clean copy of hbase:meta to
their in-memory representations. how can i achieve that? assume that i have
cleared meta table and now what? rolling restart of hmasters? do standby
masters share the same in-memory meta table with the active one? if that's
the case i think rolling restart wouldn't solve that problem.. or should i
shut all masters down and then start them again in order to force them to
rebuild their in-memories from meta table?
any helps would be appreciated.
thank you for your patience :)


  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message