hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From PRANEESH KUMAR <praneesh.san...@gmail.com>
Subject Re: HConnection thread waiting on blocking queue indefinitely
Date Tue, 30 Jun 2015 13:10:37 GMT
Hi,

We are still facing this issue in production.

Any help is appreciated.

Thanks,

Praneesh

On Mon, Jun 22, 2015 at 11:19 AM, mukund murrali <mukundmurrali9@gmail.com>
wrote:

> Hi All
>
> I have moved this as jira.
>
> https://issues.apache.org/jira/browse/HBASE-13942
>
> Please post all your opinions there.
>
> Thanks
>
> On Mon, Jun 22, 2015 at 10:53 AM, Ted Yu <yuzhihong@gmail.com> wrote:
>
> > I was out of the country this past week where access to gmail was
> > difficult.
> >
> > Looking at client stack trace, it seems the hang corresponded to the
> > following:
> >         at
> >
> >
> org.apache.hadoop.hbase.client.ClientSmallReversedScanner.next(ClientSmallReversedScanner.java:145)
> >         at
> >
> >
> org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegionInMeta(ConnectionManager.java:1200)
> >
> > Will continue digging through the stack traces / logs.
> >
> > Cheers
> >
> > On Wed, Jun 17, 2015 at 10:35 PM, mukund murrali <
> mukundmurrali9@gmail.com
> > >
> > wrote:
> >
> > > Even with 1.1.0 the issue persists. Client side blocking wait still
> > happens
> > > during first region split. Tried in distributed set up with 1.0.0 as
> > > suggested by you and had the same results.
> > >
> > > Client jstack - http://pastebin.com/Ptw0JhdG
> > >
> > > RS Hosting Table Log - http://pastebin.com/ZSD4YUE5
> > >
> > > One point to note is The RS having hbase:meta showed no logs of split
> but
> > > the master had info about it. Why is it so? hbase:meta moved to master?
> > >
> > > Master Log: http://pastebin.com/f2suyNr1
> > >
> > > One more interesting finding is in thread stack of RS Hosting table
> from
> > > the time client hangs, there is a hconnection in waiting state.
> > Subsequent
> > > thread dumps also had hconnection in waiting state. Is there any
> > deadlock?
> > > See if it can be of any use for analyzing.
> > >
> > > Thread Stack of RS hosting table - http://pastebin.com/rGbJyrPB
> > >
> > > Also AM.ZK.Worker threads waiting in Master. The pastebin of HMaster
> > during
> > > client hang and region split is
> > >
> > > http://pastebin.com/3pgVYpYW
> > >
> > > Thanks
> > >
> > > On Thu, Jun 11, 2015 at 10:48 PM, Ted Yu <yuzhihong@gmail.com> wrote:
> > >
> > > > Looking at the revision history for ClientSmallReversedScanner.java
> > which
> > > > appeared in the stack trace, there have been several bug fixes on top
> > of
> > > > the hbase release you're using.
> > > >
> > > > Can you try hbase 1.1.0 to see if the problem can be reproduced (in
> > > cluster
> > > > deployment) ?
> > > >
> > > > Thanks
> > > >
> > > > On Tue, Jun 9, 2015 at 11:42 PM, mukund murrali <
> > > mukundmurrali9@gmail.com>
> > > > wrote:
> > > >
> > > > > Kindly look into this for full trace of RS.
> > > > > http://pastebin.com/VS17vVd8
> > > > >
> > > > > Thanks
> > > > >
> > > > > On Wed, Jun 10, 2015 at 11:35 AM, Ted Yu <yuzhihong@gmail.com>
> > wrote:
> > > > >
> > > > > > Can you pastebin the complete stack trace for the region server
?
> > > > > >
> > > > > > Thanks
> > > > > >
> > > > > >
> > > > > >
> > > > > > > On Jun 9, 2015, at 10:52 PM, mukund murrali <
> > > > mukundmurrali9@gmail.com>
> > > > > > wrote:
> > > > > > >
> > > > > > > We are using HBase-1.0.0. Just before the client stalled,
in RS
> > > there
> > > > > > were
> > > > > > > few handler threads that were blocked for  MVCC(thread
stack
> > below)
> > > > > > check.
> > > > > > > Not sure if it could cause a problem. I don't see anything
> > unusual
> > > in
> > > > > RS
> > > > > > > threads. Also the same client can connect to regionserver
after
> > > > > restart.
> > > > > > At
> > > > > > > that instant what causing the problem is what we are confused.
> > > > > > >
> > > > > > >
> > > > > > > java.lang.Thread.State: BLOCKED (on object monitor)
> > > > > > >        at java.lang.Object.wait(Native Method)
> > > > > > >        at
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.regionserver.MultiVersionConsistencyControl.waitForPreviousTransactionsComplete(MultiVersionConsistencyControl.java:224)
> > > > > > >        - locked <0x00000007ac0e0e88> (a java.util.LinkedList)
> > > > > > >        at
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.regionserver.MultiVersionConsistencyControl.completeMemstoreInsertWithSeqNum(MultiVersionConsistencyControl.java:127)
> > > > > > >        at
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.regionserver.HRegion.doMiniBatchMutation(HRegion.java:2822)
> > > > > > >        at
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.regionserver.HRegion.batchMutate(HRegion.java:2476)
> > > > > > >        at
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.regionserver.HRegion.batchMutate(HRegion.java:2430)
> > > > > > >        at
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.regionserver.HRegion.batchMutate(HRegion.java:2434)
> > > > > > >        at
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.regionserver.RSRpcServices.doBatchOp(RSRpcServices.java:640)
> > > > > > >        at
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.regionserver.RSRpcServices.doNonAtomicRegionMutation(RSRpcServices.java:604)
> > > > > > >        at
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.regionserver.RSRpcServices.multi(RSRpcServices.java:1832)
> > > > > > >        at
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:31313)
> > > > > > >        at
> > > > > org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2031)
> > > > > > >        at
> > > > > org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:107)
> > > > > > >        at
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:130)
> > > > > > >        at
> > > > > > >
> > org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:107)
> > > > > > >        at java.lang.Thread.run(Thread.java:745)
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >> On Tue, Jun 9, 2015 at 6:48 PM, Anoop John <
> > anoop.hbase@gmail.com
> > > >
> > > > > > wrote:
> > > > > > >>
> > > > > > >> Can you see at this time, what the threads at RS doing?
> Handlers
> > > > > > mainly..
> > > > > > >> which version oh hbase?
> > > > > > >>
> > > > > > >>> On Tuesday, June 9, 2015, mukund murrali <
> > > mukundmurrali9@gmail.com
> > > > >
> > > > > > wrote:
> > > > > > >>> Hi
> > > > > > >>>
> > > > > > >>> I wrote a sample program with default client configurations
> and
> > > > > > created a
> > > > > > >>> single connection. I spawn client threads >
> > > > > > hbase.hconnection.threads.max
> > > > > > >>> from my client application and each thread insert
data to
> hbase
> > > > > > cluster.
> > > > > > >>> Once a region split happens, all the hconnection
threads(core
> > > pool
> > > > > and
> > > > > > >> max
> > > > > > >>> pool size were kept at 256) stalled at
> > > > > BoundedCompletionService.take()
> > > > > > >>> indefinitely. Even after the split completed it
never
> resumed.
> > > > > > >>>
> > > > > > >>> So does it mean I have to create more instances
of connection
> > > > object
> > > > > > for
> > > > > > >> a
> > > > > > >>> cluster in such scenarios (which is really not
needed) ?
> There
> > > was
> > > > no
> > > > > > >>> exception (I expected a RejectedExecution) also
in client
> side.
> > > So
> > > > > > >> changing
> > > > > > >>> the  hbase.hconnection.threads.max,
> > > hbase.hconnection.threads.core
> > > > > can
> > > > > > >>> create such problem?
> > > > > > >>>
> > > > > > >>>
> > > > > > >>>
> > > > > > >>> On Sat, Jun 6, 2015 at 5:02 PM, ramkrishna vasudevan
<
> > > > > > >>> ramkrishna.s.vasudevan@gmail.com> wrote:
> > > > > > >>>
> > > > > > >>>> Not very sure on what could be the problem
when the meta
> > update
> > > > > > >> happened.
> > > > > > >>>> I would think that when the region split happened,
there was
> > > some
> > > > > > issue
> > > > > > >> on
> > > > > > >>>> the meta update (as you said in the later mail).
The
> splitted
> > > > > regions
> > > > > > >> would
> > > > > > >>>> not have been updated properly in the META.
 So any client
> > > > > > updates/reads
> > > > > > >>>> happening to this region would have stalled
and hence your
> > > client
> > > > > > >>>> application also stalled.
> > > > > > >>>>
> > > > > > >>>> As I said the logs would be important here
to know what
> > > happened.
> > > > > > This
> > > > > > >>>> could be one of a case and could be identified
with the
> logs.
> > > > > > >>>>
> > > > > > >>>> Regards
> > > > > > >>>> Ram
> > > > > > >>>>
> > > > > > >>>> On Sat, Jun 6, 2015 at 1:25 PM, mukund murrali
<
> > > > > > >> mukundmurrali9@gmail.com>
> > > > > > >>>> wrote:
> > > > > > >>>>
> > > > > > >>>>> Sorry for misleading by specifying it as
meta split. It was
> > > meta
> > > > > > >> update
> > > > > > >>>>> during a user region split. This had caused
the stallation
> > > > > probably.
> > > > > > >> We
> > > > > > >>>>> have right now reverting client configs.
Till now we didn't
> > > face
> > > > > the
> > > > > > >>>> issue
> > > > > > >>>>> again. Those changes causing some kindof
exceptions or
> > timeout
> > > > was
> > > > > > >> what
> > > > > > >>>> we
> > > > > > >>>>> expected, but clients stalling indefinitely
is what
> worrying
> > > us.
> > > > > > >>>>>
> > > > > > >>>>> On Friday 5 June 2015, Vladimir Rodionov
<
> > > vladrodionov@gmail.com
> > > > >
> > > > > > >> wrote:
> > > > > > >>>>>
> > > > > > >>>>>> I would suggest reverting client config
changes back to
> > > > defaults.
> > > > > At
> > > > > > >>>>> least
> > > > > > >>>>>> we will know if the issue is somehow
related to client
> > config
> > > > > > >> changes.
> > > > > > >>>>>> On Jun 5, 2015 6:15 AM, "ramkrishna
vasudevan" <
> > > > > > >>>>>> ramkrishna.s.vasudevan@gmail.com <javascript:;>>
wrote:
> > > > > > >>>>>>
> > > > > > >>>>>>> Hbase:meta getting split? It may
b some user region, can
> u
> > > > check
> > > > > > >>>> that?
> > > > > > >>>>> If
> > > > > > >>>>>>> ur meta was splitting then there
is something wrong.
> > > > > > >>>>>>> Can u attach the log snippets.
> > > > > > >>>>>>>
> > > > > > >>>>>>> Sent from phone. Excuse typos.
> > > > > > >>>>>>> On Jun 5, 2015 6:00 PM, "mukund
murrali" <
> > > > > > >> mukundmurrali9@gmail.com
> > > > > > >>>>>> <javascript:;>> wrote:
> > > > > > >>>>>>>
> > > > > > >>>>>>>> Hi
> > > > > > >>>>>>>>
> > > > > > >>>>>>>> In our case there at that instance
when the client
> thread
> > > > > > >> stalled,
> > > > > > >>>>>> there
> > > > > > >>>>>>>> was a hbase:meta region split
happening. So what went
> > wrong?
> > > > If
> > > > > > >>>> there
> > > > > > >>>>>> is
> > > > > > >>>>>>> a
> > > > > > >>>>>>>> split why should hconnection
thread stall? Since we
> > changed
> > > > the
> > > > > > >>>>> client
> > > > > > >>>>>>>> configuration caused this?
I am once again specifying
> our
> > > > client
> > > > > > >>>>>> related
> > > > > > >>>>>>>> changes we did
> > > > > > >>>>>>>>
> > > > > > >>>>>>>> hbase.client.retries.number
=> 5
> > > > > > >>>>>>>> zookeeper.recovery.retry =>
0
> > > > > > >>>>>>>> zookeeper.session.timeout =>
1000
> > > > > > >>>>>>>> zookeeper.recovery.retry.
> > > > > > >>>>>>>> intervalmilli => 1
> > > > > > >>>>>>>> hbase.rpc.timeout => 30000.
> > > > > > >>>>>>>>
> > > > > > >>>>>>>> Is zk timeout too low?
> > > > > > >>>>>>>>
> > > > > > >>>>>>>>
> > > > > > >>>>>>>>
> > > > > > >>>>>>>>
> > > > > > >>>>>>>>
> > > > > > >>>>>>>>
> > > > > > >>>>>>>> On Fri, Jun 5, 2015 at 11:37
AM, ramkrishna vasudevan <
> > > > > > >>>>>>>> ramkrishna.s.vasudevan@gmail.com
<javascript:;>> wrote:
> > > > > > >>>>>>>>
> > > > > > >>>>>>>>> When you started  your
client server was the META table
> > > > > > >> assigned.
> > > > > > >>>>>> May
> > > > > > >>>>>>> be
> > > > > > >>>>>>>>> some thing happened around
that time and the client app
> > was
> > > > > > >> just
> > > > > > >>>>>>> waiting
> > > > > > >>>>>>>> on
> > > > > > >>>>>>>>> the meta table to be assigned.
 It would have retried -
> > Can
> > > > > > >> you
> > > > > > >>>>> check
> > > > > > >>>>>>> the
> > > > > > >>>>>>>>> logs.?
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>> So the best part here is
the stand alone client was
> able
> > to
> > > > be
> > > > > > >>>>>>>> successful -
> > > > > > >>>>>>>>> which means the new clients
were able to talk
> > successfully
> > > > > > >> with
> > > > > > >>>> the
> > > > > > >>>>>>>>> server.  And hence the
restart of your client has
> solved
> > > > your
> > > > > > >>>>>> problem.
> > > > > > >>>>>>>> It
> > > > > > >>>>>>>>> may be difficult to trouble
shoot the exact issue with
> > the
> > > > > > >>>> limited
> > > > > > >>>>>>> info -
> > > > > > >>>>>>>>> but see if your client
app regularly gets stalled and
> > then
> > > it
> > > > > > >> is
> > > > > > >>>>>> better
> > > > > > >>>>>>>> to
> > > > > > >>>>>>>>> trouble shoot your app
and the way it accesses the
> > server.
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>> On Fri, Jun 5, 2015 at
11:21 AM, PRANEESH KUMAR <
> > > > > > >>>>>>>> praneesh.sankar@gmail.com <javascript:;>
> > > > > > >>>>>>>>> wrote:
> > > > > > >>>>>>>>>
> > > > > > >>>>>>>>>> The client connection
was in stalled state. But there
> > was
> > > > > > >> only
> > > > > > >>>>> one
> > > > > > >>>>>>>>>> hconnection thread
found in our thread dump, which was
> > > > > > >> waiting
> > > > > > >>>>>>>>> indefinitely
> > > > > > >>>>>>>>>> in BoundedCompletionService.take
call. Meanwhile we
> ran
> > a
> > > > > > >>>>>> standalone
> > > > > > >>>>>>>> test
> > > > > > >>>>>>>>>> program which was successful.
> > > > > > >>>>>>>>>>
> > > > > > >>>>>>>>>> Once we restarted the
client server, the problem got
> > > > > > >> resolved.
> > > > > > >>>>>>>>>>
> > > > > > >>>>>>>>>> The basic doubt is,
when the hconnection thread
> stalled,
> > > why
> > > > > > >>>> the
> > > > > > >>>>>>> HBase
> > > > > > >>>>>>>>>> client failed to create
any more hconnections(max pool
> > > size
> > > > > > >> was
> > > > > > >>>>>> 10).
> > > > > > >>>>>>> In
> > > > > > >>>>>>>>>> case of problem with
table/meta regions how come the
> > test
> > > > > > >>>> program
> > > > > > >>>>>>>>>> succeeded.
> > > > > > >>>>>>>>>>
> > > > > > >>>>>>>>>> Regards,
> > > > > > >>>>>>>>>> Praneesh
> > > > > > >>>>>>>>>>
> > > > > > >>>>>>>>>> On Fri, Jun 5, 2015
at 10:21 AM, ramkrishna vasudevan
> <
> > > > > > >>>>>>>>>> ramkrishna.s.vasudevan@gmail.com
<javascript:;>>
> wrote:
> > > > > > >>>>>>>>>>
> > > > > > >>>>>>>>>>> Can you tell us
more. Is your client not working at
> all
> > > > > > >> and
> > > > > > >>>> it
> > > > > > >>>>> is
> > > > > > >>>>>>>>>> stalled ?
> > > > > > >>>>>>>>>>> Are you seeing
some results but you find it slow than
> > you
> > > > > > >>>>>> expected?
> > > > > > >>>>>>>>>>>
> > > > > > >>>>>>>>>>> What type of workload
are you running?  All the
> tables
> > > are
> > > > > > >>>>>> healthy?
> > > > > > >>>>>>>>> Are
> > > > > > >>>>>>>>>>> you able to read
or write to them individually using
> > the
> > > > > > >>>> hbase
> > > > > > >>>>>>> shell?
> > > > > > >>>>>>>>>>>
> > > > > > >>>>>>>>>>> On Fri, Jun 5,
2015 at 10:18 AM, PRANEESH KUMAR <
> > > > > > >>>>>>>>>> praneesh.sankar@gmail.com
<javascript:;>
> > > > > > >>>>>>>>>>> wrote:
> > > > > > >>>>>>>>>>>
> > > > > > >>>>>>>>>>>> Hi Ram,
> > > > > > >>>>>>>>>>>>
> > > > > > >>>>>>>>>>>> The cluster
ran without any problem for about 2 to 3
> > > > > > >> days
> > > > > > >>>>> with
> > > > > > >>>>>>> low
> > > > > > >>>>>>>>>> load,
> > > > > > >>>>>>>>>>>> once we enabled
it for high load we immediately
> faced
> > > > > > >> this
> > > > > > >>>>>> issue.
> > > > > > >>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>
> > > > > > >>>>>>>>>>>> Regards,
> > > > > > >>>>>>>>>>>> Praneesh.
> > > > > > >>>>>>>>>>>>
> > > > > > >>>>>>>>>>>> On Thursday
4 June 2015, ramkrishna vasudevan <
> > > > > > >>>>>>>>>>>> ramkrishna.s.vasudevan@gmail.com
<javascript:;>>
> > wrote:
> > > > > > >>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>> Is your
cluster in working condition.  Can you see
> if
> > > > > > >> the
> > > > > > >>>>>> META
> > > > > > >>>>>>>> has
> > > > > > >>>>>>>>>> been
> > > > > > >>>>>>>>>>>>> assigned
properly?  If the META table is not
> > > > > > >> initialized
> > > > > > >>>>> and
> > > > > > >>>>>>>> opened
> > > > > > >>>>>>>>>>> then
> > > > > > >>>>>>>>>>>>> your client
thread will hang.
> > > > > > >>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>> Regards
> > > > > > >>>>>>>>>>>>> Ram
> > > > > > >>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>> On Thu,
Jun 4, 2015 at 9:05 PM, PRANEESH KUMAR <
> > > > > > >>>>>>>>>>>> praneesh.sankar@gmail.com
<javascript:;>
> > > > > > >>>>>>>>>>>>> <javascript:;>>
> > > > > > >>>>>>>>>>>>> wrote:
> > > > > > >>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>> Hi,
> > > > > > >>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>> We
are using Hbase-1.0.0. We also facing the same
> > > > > > >> issue
> > > > > > >>>>>> that
> > > > > > >>>>>>>>> client
> > > > > > >>>>>>>>>>>>>> connection
thread is waiting at
> > > > > > >>
> > > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegionInMeta(ConnectionManager.java:1200).
> > > > > > >>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>> Any
help is appreciated.
> > > > > > >>>>>>>>>>>>>>
> > > > > > >>>>>>>>>>>>>> Regards,
> > > > > > >>>>>>>>>>>>>> Praneesh
> > > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message