hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sandeep L <sandeepvre...@outlook.com>
Subject RE: Handling regionserver crashes in production cluster
Date Mon, 02 Sep 2013 11:23:36 GMT
Even we are facing same problem, is it fixed in hbase 0.94.8 or 0.97.6 ?
If it is fixed we will migrate, can some one conform about this?
Thanks,Sandeep.

> From: nkeywal@gmail.com
> Date: Thu, 13 Jun 2013 09:00:46 +0200
> Subject: Re: Handling regionserver crashes in production cluster
> To: user@hbase.apache.org
> 
> Hum... So even a simple get shows the issue?
> It would be a (surprising) critical bug. Could you please try the 95.1 or
> the 94.8? Or write an unit test?
> 
> Thanks,
> 
> Nicolas
> 
> 
> On Thu, Jun 13, 2013 at 5:43 AM, kiran <kiran.sarvabhotla@gmail.com> wrote:
> 
> > Its a simple kill...
> > Scan is used using startrow and stoprow
> > Scan scan = new Scan(Bytes.toBytes("adidas"), Bytes.toBytes("adidas1"));
> >
> >
> > Our cluster size is 15. The load average when I see in master is 78%...It
> > is not that overloaded. but writes are happening in the cluster...
> >
> > Thanks
> > Kiran
> >
> >
> >
> > On Wed, Jun 12, 2013 at 10:49 PM, Nicolas Liochon <nkeywal@gmail.com>
> > wrote:
> >
> > > Yeah, it should not block the other regions.
> > >
> > > For the region server, was it a kill -9 or in simple kill (the former
> > > triggers a recovery, the later will close the region before stopping the
> > > process)?
> > >
> > > How do you select the scan scope? With stop/start rows?
> > > Can you share the client code you're using?
> > > What's the cluster size? Was it already very loaded before you killed the
> > > region server?
> > >
> > > Nicolas
> > >
> > >
> > >
> > > On Wed, Jun 12, 2013 at 6:11 PM, kiran <kiran.sarvabhotla@gmail.com>
> > > wrote:
> > >
> > > > Yes we killed the region server but datanode is still running on the
> > > > node...
> > > >
> > > > Sample Test scenario: Assume, I have table with pre-splits a upto z
> > > (about
> > > > 26 regions). I brought down region server purposefully with regions
> > > having
> > > > prefixes c and d. Then I used client API to scan data from regions with
> > > > prefixes other than c and d. The response was very slow and sometimes
> > not
> > > > coming at all.
> > > >
> > > > My doubt was if only regions with prefix c and d are getting relocated
> > or
> > > > in transition. Why is it affecting the regions with other prefixes....
> > > But
> > > > once the region transition is over, the response is very fast as
> > > expected.
> > > >
> > > >
> > > >
> > > > On Wed, Jun 12, 2013 at 8:50 PM, rajesh babu chintaguntla <
> > > > chrajeshbabu32@gmail.com> wrote:
> > > >
> > > > > You can configure below to more value to close more regions at a
> > time.
> > > > >
> > > > >  <property>
> > > > >     <name>hbase.regionserver.executor.closeregion.threads</name>
> > > > >     <value>3</value>
> > > > >   </property>
> > > > >
> > > > >
> > > > > On Wed, Jun 12, 2013 at 7:38 PM, Nicolas Liochon <nkeywal@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > What was your test exactly? You killed -9 a region server but
kept
> > > the
> > > > > > datanode alive?
> > > > > > Could you detail the queries you were doing?
> > > > > >
> > > > > >
> > > > > > On Wed, Jun 12, 2013 at 2:10 PM, kiran <
> > kiran.sarvabhotla@gmail.com>
> > > > > > wrote:
> > > > > >
> > > > > > > It is not possible for us to migrate to new version immediately.
> > > > > > >
> > > > > > > @Anoop we purposefully brought down one regionserver, then
we
> > > > observed
> > > > > > the
> > > > > > > website is taking too much time to respond. We observed
the
> > pattern
> > > > for
> > > > > > > about 5 min till the regions are relocated.
> > > > > > > Also we issued queries in our website taking care that
the
> > queries
> > > > did
> > > > > > n't
> > > > > > > come under the regions in the regionserver we brought down.
> > > > > > >
> > > > > > > Is there any configuration workaround to mitigate it??
> > > > > > >
> > > > > > > Thanks
> > > > > > > Kiran
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > On Thu, Jun 6, 2013 at 8:27 PM, Jean-Marc Spaggiari <
> > > > > > > jean-marc@spaggiari.org
> > > > > > > > wrote:
> > > > > > >
> > > > > > > > Hi Kiran,
> > > > > > > >
> > > > > > > > Also, any chance for you to migrate to 0.94.8? There
have been
> > > > > > > > hundreds of fixes since 0.94.1...
> > > > > > > >
> > > > > > > > JM
> > > > > > > >
> > > > > > > > 2013/6/6 Anoop John <anoop.hbase@gmail.com>:
> > > > > > > > > How many total RS in the cluster?  You mean u
can not do any
> > > > > > operation
> > > > > > > on
> > > > > > > > > other regions in the live clusters?  It should
not happen..
> >  Is
> > > > it
> > > > > so
> > > > > > > > > happening that the client ops are targetted at
the regions
> > > which
> > > > > were
> > > > > > > in
> > > > > > > > > the dead RS( and in transition now)?   Can u
have a closer
> > look
> > > > and
> > > > > > > see?
> > > > > > > > > If not pls check the RS threads were they are
getting
> > blocked.
> > > > > > > > >
> > > > > > > > > -Anoop-
> > > > > > > > >
> > > > > > > > > On Wed, Jun 5, 2013 at 10:50 PM, kiran <
> > > > > kiran.sarvabhotla@gmail.com>
> > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > >> Dear All,
> > > > > > > > >>
> > > > > > > > >> We have production cluster that runs on hbase
0.94.1. The
> > > issue
> > > > we
> > > > > > are
> > > > > > > > >> facing is whenever one regionserver goes
down, the cluster
> > > > becomes
> > > > > > > > >> unresponsive until all the regions are allocated
to another
> > > > > > > > >> regionserver(s). The transition is taking
about 3-5 mins and
> > > > > during
> > > > > > > this
> > > > > > > > >> time we are unable to any do client operation
on the
> > cluster.
> > > > > > > > >>
> > > > > > > > >> Is there any way we can make the transition
to run in
> > > > background ?
> > > > > > > > >>
> > > > > > > > >> Also, it is acceptable for us if the client
operations such
> > as
> > > > > scan
> > > > > > or
> > > > > > > > get
> > > > > > > > >> does not work on the rowkeys of regions in
transition. But,
> > > they
> > > > > are
> > > > > > > not
> > > > > > > > >> working on the entire cluster until all the
regions are
> > moved
> > > > out
> > > > > of
> > > > > > > > >> transition. We can't afford 3-5 minutes of
downtime.
> > > > > > > > >>
> > > > > > > > >> --
> > > > > > > > >> Thank you
> > > > > > > > >> Kiran Sarvabhotla
> > > > > > > > >>
> > > > > > > > >> -----Even a correct decision is wrong when
it is taken late
> > > > > > > > >>
> > > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > Thank you
> > > > > > > Kiran Sarvabhotla
> > > > > > >
> > > > > > > -----Even a correct decision is wrong when it is taken
late
> > > > > > >
> > > > > >
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Thank you
> > > > Kiran Sarvabhotla
> > > >
> > > > -----Even a correct decision is wrong when it is taken late
> > > >
> > >
> >
> >
> >
> > --
> > Thank you
> > Kiran Sarvabhotla
> >
> > -----Even a correct decision is wrong when it is taken late
> >
 		 	   		  
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message