hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From mukund murrali <mukundmurra...@gmail.com>
Subject Fwd: HConnection thread waiting on blocking queue indefinitely
Date Thu, 18 Jun 2015 05:35:10 GMT
Even with 1.1.0 the issue persists. Client side blocking wait still happens
during first region split. Tried in distributed set up with 1.0.0 as
suggested by you and had the same results.

Client jstack - http://pastebin.com/Ptw0JhdG

RS Hosting Table Log - http://pastebin.com/ZSD4YUE5

One point to note is The RS having hbase:meta showed no logs of split but
the master had info about it. Why is it so? hbase:meta moved to master?

Master Log: http://pastebin.com/f2suyNr1

One more interesting finding is in thread stack of RS Hosting table from
the time client hangs, there is a hconnection in waiting state. Subsequent
thread dumps also had hconnection in waiting state. Is there any deadlock?
See if it can be of any use for analyzing.

Thread Stack of RS hosting table - http://pastebin.com/rGbJyrPB

Also AM.ZK.Worker threads waiting in Master. The pastebin of HMaster during
client hang and region split is

http://pastebin.com/3pgVYpYW

Thanks

On Thu, Jun 11, 2015 at 10:48 PM, Ted Yu <yuzhihong@gmail.com> wrote:

> Looking at the revision history for ClientSmallReversedScanner.java which
> appeared in the stack trace, there have been several bug fixes on top of
> the hbase release you're using.
>
> Can you try hbase 1.1.0 to see if the problem can be reproduced (in cluster
> deployment) ?
>
> Thanks
>
> On Tue, Jun 9, 2015 at 11:42 PM, mukund murrali <mukundmurrali9@gmail.com>
> wrote:
>
> > Kindly look into this for full trace of RS.
> > http://pastebin.com/VS17vVd8
> >
> > Thanks
> >
> > On Wed, Jun 10, 2015 at 11:35 AM, Ted Yu <yuzhihong@gmail.com> wrote:
> >
> > > Can you pastebin the complete stack trace for the region server ?
> > >
> > > Thanks
> > >
> > >
> > >
> > > > On Jun 9, 2015, at 10:52 PM, mukund murrali <
> mukundmurrali9@gmail.com>
> > > wrote:
> > > >
> > > > We are using HBase-1.0.0. Just before the client stalled, in RS there
> > > were
> > > > few handler threads that were blocked for  MVCC(thread stack below)
> > > check.
> > > > Not sure if it could cause a problem. I don't see anything unusual in
> > RS
> > > > threads. Also the same client can connect to regionserver after
> > restart.
> > > At
> > > > that instant what causing the problem is what we are confused.
> > > >
> > > >
> > > > java.lang.Thread.State: BLOCKED (on object monitor)
> > > >        at java.lang.Object.wait(Native Method)
> > > >        at
> > > >
> > >
> >
> org.apache.hadoop.hbase.regionserver.MultiVersionConsistencyControl.waitForPreviousTransactionsComplete(MultiVersionConsistencyControl.java:224)
> > > >        - locked <0x00000007ac0e0e88> (a java.util.LinkedList)
> > > >        at
> > > >
> > >
> >
> org.apache.hadoop.hbase.regionserver.MultiVersionConsistencyControl.completeMemstoreInsertWithSeqNum(MultiVersionConsistencyControl.java:127)
> > > >        at
> > > >
> > >
> >
> org.apache.hadoop.hbase.regionserver.HRegion.doMiniBatchMutation(HRegion.java:2822)
> > > >        at
> > > >
> > >
> >
> org.apache.hadoop.hbase.regionserver.HRegion.batchMutate(HRegion.java:2476)
> > > >        at
> > > >
> > >
> >
> org.apache.hadoop.hbase.regionserver.HRegion.batchMutate(HRegion.java:2430)
> > > >        at
> > > >
> > >
> >
> org.apache.hadoop.hbase.regionserver.HRegion.batchMutate(HRegion.java:2434)
> > > >        at
> > > >
> > >
> >
> org.apache.hadoop.hbase.regionserver.RSRpcServices.doBatchOp(RSRpcServices.java:640)
> > > >        at
> > > >
> > >
> >
> org.apache.hadoop.hbase.regionserver.RSRpcServices.doNonAtomicRegionMutation(RSRpcServices.java:604)
> > > >        at
> > > >
> > >
> >
> org.apache.hadoop.hbase.regionserver.RSRpcServices.multi(RSRpcServices.java:1832)
> > > >        at
> > > >
> > >
> >
> org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:31313)
> > > >        at
> > org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2031)
> > > >        at
> > org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:107)
> > > >        at
> > > >
> > >
> >
> org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:130)
> > > >        at
> > > > org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:107)
> > > >        at java.lang.Thread.run(Thread.java:745)
> > > >
> > > >
> > > >
> > > >
> > > >> On Tue, Jun 9, 2015 at 6:48 PM, Anoop John <anoop.hbase@gmail.com>
> > > wrote:
> > > >>
> > > >> Can you see at this time, what the threads at RS doing? Handlers
> > > mainly..
> > > >> which version oh hbase?
> > > >>
> > > >>> On Tuesday, June 9, 2015, mukund murrali <mukundmurrali9@gmail.com
> >
> > > wrote:
> > > >>> Hi
> > > >>>
> > > >>> I wrote a sample program with default client configurations and
> > > created a
> > > >>> single connection. I spawn client threads >
> > > hbase.hconnection.threads.max
> > > >>> from my client application and each thread insert data to hbase
> > > cluster.
> > > >>> Once a region split happens, all the hconnection threads(core
pool
> > and
> > > >> max
> > > >>> pool size were kept at 256) stalled at
> > BoundedCompletionService.take()
> > > >>> indefinitely. Even after the split completed it never resumed.
> > > >>>
> > > >>> So does it mean I have to create more instances of connection
> object
> > > for
> > > >> a
> > > >>> cluster in such scenarios (which is really not needed) ? There
was
> no
> > > >>> exception (I expected a RejectedExecution) also in client side.
So
> > > >> changing
> > > >>> the  hbase.hconnection.threads.max, hbase.hconnection.threads.core
> > can
> > > >>> create such problem?
> > > >>>
> > > >>>
> > > >>>
> > > >>> On Sat, Jun 6, 2015 at 5:02 PM, ramkrishna vasudevan <
> > > >>> ramkrishna.s.vasudevan@gmail.com> wrote:
> > > >>>
> > > >>>> Not very sure on what could be the problem when the meta update
> > > >> happened.
> > > >>>> I would think that when the region split happened, there was
some
> > > issue
> > > >> on
> > > >>>> the meta update (as you said in the later mail). The splitted
> > regions
> > > >> would
> > > >>>> not have been updated properly in the META.  So any client
> > > updates/reads
> > > >>>> happening to this region would have stalled and hence your
client
> > > >>>> application also stalled.
> > > >>>>
> > > >>>> As I said the logs would be important here to know what happened.
> > > This
> > > >>>> could be one of a case and could be identified with the logs.
> > > >>>>
> > > >>>> Regards
> > > >>>> Ram
> > > >>>>
> > > >>>> On Sat, Jun 6, 2015 at 1:25 PM, mukund murrali <
> > > >> mukundmurrali9@gmail.com>
> > > >>>> wrote:
> > > >>>>
> > > >>>>> Sorry for misleading by specifying it as meta split. It
was meta
> > > >> update
> > > >>>>> during a user region split. This had caused the stallation
> > probably.
> > > >> We
> > > >>>>> have right now reverting client configs. Till now we didn't
face
> > the
> > > >>>> issue
> > > >>>>> again. Those changes causing some kindof exceptions or
timeout
> was
> > > >> what
> > > >>>> we
> > > >>>>> expected, but clients stalling indefinitely is what worrying
us.
> > > >>>>>
> > > >>>>> On Friday 5 June 2015, Vladimir Rodionov <vladrodionov@gmail.com
> >
> > > >> wrote:
> > > >>>>>
> > > >>>>>> I would suggest reverting client config changes back
to
> defaults.
> > At
> > > >>>>> least
> > > >>>>>> we will know if the issue is somehow related to client
config
> > > >> changes.
> > > >>>>>> On Jun 5, 2015 6:15 AM, "ramkrishna vasudevan" <
> > > >>>>>> ramkrishna.s.vasudevan@gmail.com <javascript:;>>
wrote:
> > > >>>>>>
> > > >>>>>>> Hbase:meta getting split? It may b some user region,
can u
> check
> > > >>>> that?
> > > >>>>> If
> > > >>>>>>> ur meta was splitting then there is something
wrong.
> > > >>>>>>> Can u attach the log snippets.
> > > >>>>>>>
> > > >>>>>>> Sent from phone. Excuse typos.
> > > >>>>>>> On Jun 5, 2015 6:00 PM, "mukund murrali" <
> > > >> mukundmurrali9@gmail.com
> > > >>>>>> <javascript:;>> wrote:
> > > >>>>>>>
> > > >>>>>>>> Hi
> > > >>>>>>>>
> > > >>>>>>>> In our case there at that instance when the
client thread
> > > >> stalled,
> > > >>>>>> there
> > > >>>>>>>> was a hbase:meta region split happening. So
what went wrong?
> If
> > > >>>> there
> > > >>>>>> is
> > > >>>>>>> a
> > > >>>>>>>> split why should hconnection thread stall?
Since we changed
> the
> > > >>>>> client
> > > >>>>>>>> configuration caused this? I am once again
specifying our
> client
> > > >>>>>> related
> > > >>>>>>>> changes we did
> > > >>>>>>>>
> > > >>>>>>>> hbase.client.retries.number => 5
> > > >>>>>>>> zookeeper.recovery.retry => 0
> > > >>>>>>>> zookeeper.session.timeout => 1000
> > > >>>>>>>> zookeeper.recovery.retry.
> > > >>>>>>>> intervalmilli => 1
> > > >>>>>>>> hbase.rpc.timeout => 30000.
> > > >>>>>>>>
> > > >>>>>>>> Is zk timeout too low?
> > > >>>>>>>>
> > > >>>>>>>>
> > > >>>>>>>>
> > > >>>>>>>>
> > > >>>>>>>>
> > > >>>>>>>>
> > > >>>>>>>> On Fri, Jun 5, 2015 at 11:37 AM, ramkrishna
vasudevan <
> > > >>>>>>>> ramkrishna.s.vasudevan@gmail.com <javascript:;>>
wrote:
> > > >>>>>>>>
> > > >>>>>>>>> When you started  your client server was
the META table
> > > >> assigned.
> > > >>>>>> May
> > > >>>>>>> be
> > > >>>>>>>>> some thing happened around that time and
the client app was
> > > >> just
> > > >>>>>>> waiting
> > > >>>>>>>> on
> > > >>>>>>>>> the meta table to be assigned.  It would
have retried - Can
> > > >> you
> > > >>>>> check
> > > >>>>>>> the
> > > >>>>>>>>> logs.?
> > > >>>>>>>>>
> > > >>>>>>>>> So the best part here is the stand alone
client was able to
> be
> > > >>>>>>>> successful -
> > > >>>>>>>>> which means the new clients were able
to talk successfully
> > > >> with
> > > >>>> the
> > > >>>>>>>>> server.  And hence the restart of your
client has solved
> your
> > > >>>>>> problem.
> > > >>>>>>>> It
> > > >>>>>>>>> may be difficult to trouble shoot the
exact issue with the
> > > >>>> limited
> > > >>>>>>> info -
> > > >>>>>>>>> but see if your client app regularly gets
stalled and then it
> > > >> is
> > > >>>>>> better
> > > >>>>>>>> to
> > > >>>>>>>>> trouble shoot your app and the way it
accesses the server.
> > > >>>>>>>>>
> > > >>>>>>>>> On Fri, Jun 5, 2015 at 11:21 AM, PRANEESH
KUMAR <
> > > >>>>>>>> praneesh.sankar@gmail.com <javascript:;>
> > > >>>>>>>>> wrote:
> > > >>>>>>>>>
> > > >>>>>>>>>> The client connection was in stalled
state. But there was
> > > >> only
> > > >>>>> one
> > > >>>>>>>>>> hconnection thread found in our thread
dump, which was
> > > >> waiting
> > > >>>>>>>>> indefinitely
> > > >>>>>>>>>> in BoundedCompletionService.take call.
Meanwhile we ran a
> > > >>>>>> standalone
> > > >>>>>>>> test
> > > >>>>>>>>>> program which was successful.
> > > >>>>>>>>>>
> > > >>>>>>>>>> Once we restarted the client server,
the problem got
> > > >> resolved.
> > > >>>>>>>>>>
> > > >>>>>>>>>> The basic doubt is, when the hconnection
thread stalled, why
> > > >>>> the
> > > >>>>>>> HBase
> > > >>>>>>>>>> client failed to create any more hconnections(max
pool size
> > > >> was
> > > >>>>>> 10).
> > > >>>>>>> In
> > > >>>>>>>>>> case of problem with table/meta regions
how come the test
> > > >>>> program
> > > >>>>>>>>>> succeeded.
> > > >>>>>>>>>>
> > > >>>>>>>>>> Regards,
> > > >>>>>>>>>> Praneesh
> > > >>>>>>>>>>
> > > >>>>>>>>>> On Fri, Jun 5, 2015 at 10:21 AM, ramkrishna
vasudevan <
> > > >>>>>>>>>> ramkrishna.s.vasudevan@gmail.com <javascript:;>>
wrote:
> > > >>>>>>>>>>
> > > >>>>>>>>>>> Can you tell us more. Is your
client not working at all
> > > >> and
> > > >>>> it
> > > >>>>> is
> > > >>>>>>>>>> stalled ?
> > > >>>>>>>>>>> Are you seeing some results but
you find it slow than you
> > > >>>>>> expected?
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> What type of workload are you
running?  All the tables are
> > > >>>>>> healthy?
> > > >>>>>>>>> Are
> > > >>>>>>>>>>> you able to read or write to them
individually using the
> > > >>>> hbase
> > > >>>>>>> shell?
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> On Fri, Jun 5, 2015 at 10:18 AM,
PRANEESH KUMAR <
> > > >>>>>>>>>> praneesh.sankar@gmail.com <javascript:;>
> > > >>>>>>>>>>> wrote:
> > > >>>>>>>>>>>
> > > >>>>>>>>>>>> Hi Ram,
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> The cluster ran without any
problem for about 2 to 3
> > > >> days
> > > >>>>> with
> > > >>>>>>> low
> > > >>>>>>>>>> load,
> > > >>>>>>>>>>>> once we enabled it for high
load we immediately faced
> > > >> this
> > > >>>>>> issue.
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> Regards,
> > > >>>>>>>>>>>> Praneesh.
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> On Thursday 4 June 2015, ramkrishna
vasudevan <
> > > >>>>>>>>>>>> ramkrishna.s.vasudevan@gmail.com
<javascript:;>> wrote:
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>>> Is your cluster in working
condition.  Can you see if
> > > >> the
> > > >>>>>> META
> > > >>>>>>>> has
> > > >>>>>>>>>> been
> > > >>>>>>>>>>>>> assigned properly?  If
the META table is not
> > > >> initialized
> > > >>>>> and
> > > >>>>>>>> opened
> > > >>>>>>>>>>> then
> > > >>>>>>>>>>>>> your client thread will
hang.
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> Regards
> > > >>>>>>>>>>>>> Ram
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>> On Thu, Jun 4, 2015 at
9:05 PM, PRANEESH KUMAR <
> > > >>>>>>>>>>>> praneesh.sankar@gmail.com
<javascript:;>
> > > >>>>>>>>>>>>> <javascript:;>>
> > > >>>>>>>>>>>>> wrote:
> > > >>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> Hi,
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> We are using Hbase-1.0.0.
We also facing the same
> > > >> issue
> > > >>>>>> that
> > > >>>>>>>>> client
> > > >>>>>>>>>>>>>> connection thread
is waiting at
> > > >>
> > > >>
> > >
> >
> org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegionInMeta(ConnectionManager.java:1200).
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> Any help is appreciated.
> > > >>>>>>>>>>>>>>
> > > >>>>>>>>>>>>>> Regards,
> > > >>>>>>>>>>>>>> Praneesh
> > > >>
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message