hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Qiang Tian <tian...@gmail.com>
Subject Re: IPC Queue Size
Date Mon, 11 Aug 2014 09:03:16 GMT
update - it has been fixed in 0.98.4, HBASE-11374


On Sat, Aug 9, 2014 at 5:53 PM, Qiang Tian <tianq01@gmail.com> wrote:

> please see HBASE-11714
>
>
> On Sat, Aug 9, 2014 at 4:27 PM, Qiang Tian <tianq01@gmail.com> wrote:
>
>> took a quick look. it looks a bug, will dig more and update later.
>>
>>
>>
>> On Sat, Aug 9, 2014 at 1:45 AM, Walter King <walter@adroll.com> wrote:
>>
>>> awesome, thanks!
>>> the resets and such were actually what I've been trying to investigate.
>>>  Recently switched from 0.94 and 0.98, and finding that periodically
>>> things
>>> are having issues - lots of retry exceptions.  Don't have much
>>> information,
>>> still looking into it when saw this.
>>>
>>> By timestamp, this seems to be the corresponding client exception:
>>>
>>> 2014-08-08 17:22:43 o.a.h.h.c.AsyncProcess [INFO] #105158,
>>> table=rt_global_monthly_campaign_deliveries, attempt=10/35 failed 500
>>> ops,
>>> last exception: java.net.SocketTimeoutException: Call to
>>> ip-10-201-128-23.us-west-1.compute.internal/10.201.128.23:60020 failed
>>> because java.net.SocketTimeoutException: 2000 millis timeout while
>>> waiting
>>> for channel to be ready for read. ch :
>>> java.nio.channels.SocketChannel[connected local=/10.248.130.152:46014
>>> remote=ip-10-201-128-23.us-west-1.compute.internal/10.201.128.23:60020]
>>> on
>>> ip-10-201-128-23.us-west-1.compute.internal,60020,1405642103651, tracking
>>> started Fri Aug 08 17:21:55 UTC 2014, retrying after 10043 ms, replay 500
>>> ops.
>>>
>>> 2014-08-08 17:22:53 o.a.h.h.c.AsyncProcess [INFO] #105158, waiting for
>>> some
>>> tasks to finish. Expected max=0, tasksSent=11, tasksDone=10,
>>> currentTasksDone=10, retries=10 hasError=false,
>>> tableName=rt_global_monthly_campaign_deliveries
>>>
>>>
>>> On Fri, Aug 8, 2014 at 2:15 AM, Esteban Gutierrez <esteban@cloudera.com>
>>> wrote:
>>>
>>> > The RequestContext seems to be ok since it hasn't been set. So just
>>> > decrementing the call size should be enough.
>>> >
>>> > cheers,
>>> > esteban.
>>> >
>>> >
>>> > --
>>> > Cloudera, Inc.
>>> >
>>> >
>>> >
>>> > On Fri, Aug 8, 2014 at 2:10 AM, Esteban Gutierrez <
>>> esteban@cloudera.com>
>>> > wrote:
>>> >
>>> > >
>>> > > Looks like after HBASE-10506 we didn't do what CallRunner used to do
>>> > > before: clear the RequestContext and decrement callQueueSize.
>>> > >
>>> > > Created HBASE-11705 to address this issue.
>>> > >
>>> > >
>>> > >
>>> > >
>>> > >
>>> > >
>>> > > --
>>> > > Cloudera, Inc.
>>> > >
>>> > >
>>> > >
>>> > > On Fri, Aug 8, 2014 at 1:56 AM, Qiang Tian <tianq01@gmail.com>
>>> wrote:
>>> > >
>>> > >> good point.  that is a big suspect.
>>> > >>
>>> > >> I check your log, ClosedChannelException should be triggered by
>>> > >>  call.sendResponseIfReady()(it is the only request in the queue,
so
>>> > >> handler
>>> > >> send response directly), but at that point the callqueueSize has
>>> been
>>> > >> decremented.
>>> > >>
>>> > >> 2014-08-05 00:50:06,727 WARN  [RpcServer.handler=57,port=60020]
>>> > >> ipc.RpcServer (RpcServer.java:processResponse(1041)) -
>>> > >> RpcServer.respondercallId: 118504 service: ClientService methodName:
>>> > Multi
>>> > >> size: 141.9 K connection: 10.248.134.67:55347: output error
>>> > >> 2014-08-05 00:50:06,727 WARN  [RpcServer.handler=57,port=60020]
>>> > >> ipc.RpcServer (CallRunner.java:run(135)) -
>>> > >> RpcServer.handler=57,port=60020:
>>> > >> caught a ClosedChannelException, this means that the server was
>>> > processing
>>> > >> a request but the client went away. The error message was: null
>>> > >>
>>> > >> it looks you have got the fix, would you file a jira?
>>> > >> thanks.
>>> > >>
>>> > >>
>>> > >> On Fri, Aug 8, 2014 at 2:41 PM, Walter King <walter@adroll.com>
>>> wrote:
>>> > >>
>>> > >> > I've only looked at the code a little, and likely missed
>>> something,
>>> > but
>>> > >> > does this if block decrement the call queue, if the client
already
>>> > >> closed
>>> > >> > the connection?
>>> > >> >
>>> > >> >
>>> > >> >
>>> > >>
>>> >
>>> https://github.com/apache/hbase/blob/07a771866f18e8ec532c14f624fa908815bd88c7/hbase-server/src/main/java/org/apache/hadoop/hbase/ipc/CallRunner.java#L74
>>> > >> >
>>> > >> >
>>> > >> >
>>> > >> > On Thu, Aug 7, 2014 at 11:32 PM, Walter King <walter@adroll.com>
>>> > wrote:
>>> > >> >
>>> > >> > > Yes, sorry, CallQueueTooBigException. but that value
never
>>> returns
>>> > to
>>> > >> > > zero, even when number of requests goes to zero.  The
call
>>> queue too
>>> > >> big
>>> > >> > > happens if any regionserver is up for a long enough period
of
>>> time,
>>> > >> so I
>>> > >> > > have to periodically restart them.  Also at that 15:30
time I
>>> wasn't
>>> > >> > > seeing that exception, but it seems like that is one
time in
>>> which a
>>> > >> call
>>> > >> > > didnt properly decrement the callqueuesize because it
was at
>>> zero
>>> > >> before
>>> > >> > > and has never hit zero again - today the minimum is even
higher.
>>> > >> > >
>>> > >> > >
>>> > >> > > On Thu, Aug 7, 2014 at 9:14 PM, Qiang Tian <tianq01@gmail.com>
>>> > wrote:
>>> > >> > >
>>> > >> > >> bq. "Eventually we ran into ipc queue size full messages
being
>>> > >> returned
>>> > >> > to
>>> > >> > >> clients trying large batch puts, as it approaches
a gigabyte."
>>> > >> > >>
>>> > >> > >> Do you mean CallQueueTooBigException? it looks not
the queue
>>> size,
>>> > >> but
>>> > >> > the
>>> > >> > >> data size that client sends..configured by
>>> > >> > >> "hbase.ipc.server.max.callqueue.size".
>>> > >> > >>
>>> > >> > >> I guess when you client got the exception, it closed
the
>>> exception
>>> > >> and
>>> > >> > >> causing other shared connection RPC failed.
>>> > >> > >>
>>> > >> > >>
>>> > >> > >> 2014-08-06 22:27:57,253 WARN  [RpcServer.reader=9,port=60020]
>>> > >> > >> ipc.RpcServer
>>> > >> > >> (RpcServer.java:doRead(794)) - RpcServer.listener,port=60020:
>>> count
>>> > >> of
>>> > >> > >> bytes read: 0
>>> > >> > >> java.io.IOException: Connection reset by peer
>>> > >> > >> at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
>>> > >> > >> at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
>>> > >> > >> at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
>>> > >> > >> at sun.nio.ch.IOUtil.read(IOUtil.java:197)
>>> > >> > >> at
>>> sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379)
>>> > >> > >> at
>>> > >> >
>>> org.apache.hadoop.hbase.ipc.RpcServer.channelRead(RpcServer.java:2229)
>>> > >> > >> at
>>> > >> > >>
>>> > >> > >>
>>> > >> >
>>> > >>
>>> >
>>> org.apache.hadoop.hbase.ipc.RpcServer$Connection.readAndProcess(RpcServer.java:1415)
>>> > >> > >> at
>>> > >> > >>
>>> > >> >
>>> > >>
>>> >
>>> org.apache.hadoop.hbase.ipc.RpcServer$Listener.doRead(RpcServer.java:790)
>>> > >> > >> at
>>> > >> > >>
>>> > >> > >>
>>> > >> >
>>> > >>
>>> >
>>> org.apache.hadoop.hbase.ipc.RpcServer$Listener$Reader.doRunLoop(RpcServer.java:581)
>>> > >> > >> at
>>> > >> > >>
>>> > >> > >>
>>> > >> >
>>> > >>
>>> >
>>> org.apache.hadoop.hbase.ipc.RpcServer$Listener$Reader.run(RpcServer.java:556)
>>> > >> > >> at
>>> > >> > >>
>>> > >> > >>
>>> > >> >
>>> > >>
>>> >
>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>>> > >> > >> at
>>> > >> > >>
>>> > >> > >>
>>> > >> >
>>> > >>
>>> >
>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>> > >> > >> at java.lang.Thread.run(Thread.java:744)
>>> > >> > >> 2014-08-06 22:27:57,257 WARN  [RpcServer.handler=18,port=60020]
>>> > >> > >> ipc.RpcServer (RpcServer.java:processResponse(1041))
-
>>> > >> > >> RpcServer.respondercallId: 84968 service: ClientService
>>> methodName:
>>> > >> > Multi
>>> > >> > >> size: 17.7 K connection: 10.248.130.152:49780: output
error
>>> > >> > >> 2014-08-06 22:27:57,258 WARN  [RpcServer.handler=18,port=60020]
>>> > >> > >> ipc.RpcServer (CallRunner.java:run(135)) -
>>> > >> > >> RpcServer.handler=18,port=60020:
>>> > >> > >> caught a ClosedChannelException, this means that
the server was
>>> > >> > processing
>>> > >> > >> a request but the client went away. The error message
was: null
>>> > >> > >> 2014-08-06 22:27:57,260 WARN  [RpcServer.handler=61,port=60020]
>>> > >> > >> ipc.RpcServer (RpcServer.java:processResponse(1041))
-
>>> > >> > >> RpcServer.respondercallId: 83907 service: ClientService
>>> methodName:
>>> > >> > Multi
>>> > >> > >> size: 17.1 K connection: 10.248.1.56:53615: output
error
>>> > >> > >> 2014-08-06 22:27:57,263 WARN  [RpcServer.handler=61,port=60020]
>>> > >> > >> ipc.RpcServer (CallRunner.java:run(135)) -
>>> > >> > >> RpcServer.handler=61,port=60020:
>>> > >> > >> caught a ClosedChannelException, this means that
the server was
>>> > >> > processing
>>> > >> > >> a request but the client went away. The error message
was: null
>>> > >> > >>
>>> > >> > >>
>>> > >> > >>
>>> > >> > >> On Fri, Aug 8, 2014 at 2:57 AM, Walter King <walter@adroll.com
>>> >
>>> > >> wrote:
>>> > >> > >>
>>> > >> > >> >
>>> > >> > >>
>>> > >> >
>>> > >>
>>> >
>>> https://gist.github.com/walterking/4c5c6f5e5e4a4946a656#file-gistfile1-txt
>>> > >> > >> >
>>> > >> > >> >
>>> > >> >
>>> > http://adroll-test-sandbox.s3.amazonaws.com/regionserver.stdout.log.gz
>>> > >> > >> >
>>> > >> > >> > These are logs from that particular server,
and the debug
>>> dump
>>> > from
>>> > >> > >> now(no
>>> > >> > >> > restart in between).  The times in the graph
are pacific, so
>>> it
>>> > >> should
>>> > >> > >> be
>>> > >> > >> > around 2014-08-06 22:25:00.  I do see some exceptions
around
>>> > there.
>>> > >> > >> >
>>> > >> > >>
>>> > >> > >
>>> > >> > >
>>> > >> >
>>> > >>
>>> > >
>>> > >
>>> >
>>>
>>
>>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message