hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From jeff saremi <jeffsar...@hotmail.com>
Subject Re: What is Dead Region Servers and how to clear them up?
Date Sat, 27 May 2017 17:58:30 GMT
Thanks @Yu Li<mailto:carp84@gmail.com>

You are absolutely correct. Dead RS's will happen regardless. My issue with this is more "psychological".
If I have done everything needed to be done to ensure that RSs are running fine and regions
are assigned and such and hbck reports are consistent then how is this list of dead region
servers helping me? other than causing anxiety?
We run our cluster on Yarn and upon restarting jobs in Yarn we get a lot of inconsistent,
unavailable regions. (and this is only one scenario). Then we'll run hbck with -repair option
(and i was wrong here too: hbck does take care of some issues) and restart the master(s).
After that there seem to be no more issues other than dead region servers being still reported.
We should not have this anymore after having taken all precautions to reset the system properly.

If was trying to write something similar to what hbck would do to take care of this specific
issue. I wouldn't mind contributing to the hbck itself either. However I needed to understand
where this list comes from and why. These are things that I could possibly automate (after
all the other steps i mentioned):
- check the ZK list of RS's. If any of the dead RS's found, remove node

- check hdfs root WALs folder. If there are any with the dead RS's name in them, delete them.
(here we need to take precaution as @Enis mentioned; possibly if the node timestamp has not
been changed in a while)

- what else? These steps are not enough

For instance, we currently have 17 servers being reported as dead. Only 3-4 of them show up
in hdfs with "-splitting" in their WALS folder. Where do the rest come from?
thanks

Jeff

________________________________
From: Yu Li <carp84@gmail.com>
Sent: Friday, May 26, 2017 10:18:09 PM
To: Hbase-User
Cc: dev@hbase.apache.org
Subject: Re: What is Dead Region Servers and how to clear them up?

bq. And having a list of "dead" servers is not a healthy thing to have.
I don't think the existence of "dead" servers means the service is
unhealthy, especially in a distributed system. Besides hbase, HDFS also
shows Live and Dead nodes in namenode UI, and people won't regard HDFS as
unhealthy if there're dead nodes.

In HBase, if some RS aborts due to unexpected issue like long GC, normally
we will restart it and once it's restarted and report to master, it will be
removed from the dead server list. So when we observed dead server in
Master UI, the first thing is to check the root cause and restart it if it
won't cause further issue.

However, sometimes we may find the server aborted due to some hardware
failure and we must offline the server for repairing. Or we need to move
some nodes to join other clusters so we stop the RS process on purpose. I
guess this is the case you're dealing with @jeff? If so, I think it's a
reasonable requirement that we supply a command in hbase to clear the dead
nodes when operator assure they no longer serves.

Best Regards,
Yu

On 27 May 2017 at 04:49, Enis Söztutar <enis.soz@gmail.com> wrote:

> In general if there are no regions in transition, the WAL recovery has
> already finished. You can watch the master's log4j log for those entries,
> but the lack of regions in transition is the easiest way to identify.
>
> Enis
>
> On Fri, May 26, 2017 at 12:14 PM, jeff saremi <jeffsaremi@hotmail.com>
> wrote:
>
> > thanks Enis
> >
> > I apologize for earlier
> >
> > This looks very close to our issue
> > When you say: "there is no "WAL" recovery is happening", how could i make
> > sure of that? Thanks
> >
> > Jeff
> >
> >
> > ________________________________
> > From: Enis Söztutar <enis.soz@gmail.com>
> > Sent: Friday, May 26, 2017 11:47:11 AM
> > To: dev@hbase.apache.org
> > Cc: hbase-user
> > Subject: Re: What is Dead Region Servers and how to clear them up?
> >
> > Jeff, please be respectful to be people who are trying to help you. This
> is
> > not acceptable behavior and will result in consequences next time.
> >
> > On the specific issue that you are seeing, it is highly likely that you
> are
> > seeing this: https://issues.apache.org/jira/browse/HBASE-14223. Having
> > those servers in the dead servers list will not hurt operations, or
> > runtimes or anything else. Possibly for those servers, there is not new
> > instance of the regionserver running in the same host and ports.
> >
> > If you want to manually clean out these, you can follow these steps:
> >  - Manually move these directries from the file system:
> > <hbase_hdfs>/WALs/dead-server-splitting
> >  - ONLY do this if you are sure that there is no "WAL" recovery is
> > happening, and there is only WAL files with names containing ".meta."
> >  - Restart HBase master.
> >
> > Upon restart, you can see that these do not show up anymore. For more
> > technical details, please refer to the jira link.
> >
> > Enis
> >
> > On Fri, May 26, 2017 at 11:03 AM, jeff saremi <jeffsaremi@hotmail.com>
> > wrote:
> >
> > > Thank you for the GFY answer
> > >
> > > And i guess to figure out how to fix these I can always go through the
> > > HBase source code.
> > >
> > >
> > > ________________________________
> > > From: Dima Spivak <dimaspivak@apache.org>
> > > Sent: Friday, May 26, 2017 9:58:00 AM
> > > To: hbase-user
> > > Subject: Re: What is Dead Region Servers and how to clear them up?
> > >
> > > Sending this back to the user mailing list.
> > >
> > > RegionServers can die for many reasons. Looking at your RegionServer
> log
> > > files should give hints as to why it's happening.
> > >
> > >
> > > -Dima
> > >
> > > On Fri, May 26, 2017 at 9:48 AM, jeff saremi <jeffsaremi@hotmail.com>
> > > wrote:
> > >
> > > > I had posted this to the user mailing list and I have not got any
> > direct
> > > > answer to my question.
> > > >
> > > > Where do dead RS's come from and how can they be cleaned up? Someone
> in
> > > > the midst of developers should know this.
> > > >
> > > > thanks
> > > >
> > > > Jeff
> > > >
> > > > ________________________________
> > > > From: jeff saremi <jeffsaremi@hotmail.com>
> > > > Sent: Thursday, May 25, 2017 10:23:17 AM
> > > > To: user@hbase.apache.org
> > > > Subject: Re: What is Dead Region Servers and how to clear them up?
> > > >
> > > > I'm still looking to get hints on how to remove the dead regions.
> > thanks
> > > >
> > > > ________________________________
> > > > From: jeff saremi <jeffsaremi@hotmail.com>
> > > > Sent: Wednesday, May 24, 2017 12:27:06 PM
> > > > To: user@hbase.apache.org
> > > > Subject: Re: What is Dead Region Servers and how to clear them up?
> > > >
> > > > i'm trying to eliminate the dead region servers.
> > > >
> > > > ________________________________
> > > > From: Ted Yu <yuzhihong@gmail.com>
> > > > Sent: Wednesday, May 24, 2017 12:17:40 PM
> > > > To: user@hbase.apache.org
> > > > Subject: Re: What is Dead Region Servers and how to clear them up?
> > > >
> > > > bq. running hbck (many times
> > > >
> > > > Can you describe the specific inconsistencies you were trying to
> > resolve
> > > ?
> > > > Depending on the inconsistencies, advice can be given on the best
> known
> > > > hbck command arguments to use.
> > > >
> > > > Feel free to pastebin master log if needed.
> > > >
> > > > On Wed, May 24, 2017 at 12:10 PM, jeff saremi <
> jeffsaremi@hotmail.com>
> > > > wrote:
> > > >
> > > > > these are the things I have done so far:
> > > > >
> > > > >
> > > > > - restarting master (few times)
> > > > >
> > > > > - running hbck (many times; this tool does not seem to be doing
> > > anything
> > > > > at all)
> > > > >
> > > > > - checking the list of region servers in ZK (none of the dead ones
> > are
> > > > > listed here)
> > > > >
> > > > > - checking the WALs under <hbase_hdfs>/WALs. Out of 11 dead
ones
> > only 3
> > > > > are listed here with "-splitting" at the end of their names and
> they
> > > > > contain one single file like: 1493846660401..meta.
> 1493922323600.meta
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > ________________________________
> > > > > From: jeff saremi <jeffsaremi@hotmail.com>
> > > > > Sent: Wednesday, May 24, 2017 9:04:11 AM
> > > > > To: user@hbase.apache.org
> > > > > Subject: What is Dead Region Servers and how to clear them up?
> > > > >
> > > > > Apparently having dead region servers is so common that a section
> of
> > > the
> > > > > master console is dedicated to that?
> > > > > How can we clean this up (preferably in an automated fashion)? Why
> > > isn't
> > > > > this being done by HBase automatically?
> > > > >
> > > > >
> > > > > thanks
> > > > >
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message