Mailing-List: contact user-help@hbase.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hbase.apache.org
Received-SPF: pass (athena.apache.org: domain of tianq01@gmail.com designates
 209.85.192.49 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CAMZUsP5boR5wAQZQQ3gC+WU-XFfRhAP+GYRnt8zcKM0UNqewYQ@mail.gmail.com>
References: 
 <CAA2cxRM0YS=xgcqYYL-HBYcyBmGxVNj4e+doZDGoEgxJB_a-EQ@mail.gmail.com>
	<CALPvCiCAPgGPrhZGTCD595QX+CUUvShz9YTQ_JhjqtdJe=EtVw@mail.gmail.com>
	<CAA2cxRODBZgtR+LC6dxWQX6kRWE9MbDoxOZDfqm-4JbVfwgnnw@mail.gmail.com>
	<CAMZUsP6PhfcmGw=bNR8zEeE57OeZfFxuS7W18wP9b_aJLU2DNA@mail.gmail.com>
	<CAA2cxRNd7==GB6k=Y6BeJ_XTfZGvqkz_z=yFKF4SE7hGGFZt6A@mail.gmail.com>
	<CAA2cxRN6ZV7MumxEk4DMkrtUa8d-c-cy73KFK0AmRJRDyvSGRQ@mail.gmail.com>
	<CAMZUsP5j0Z=S=fvoAE3=5JhKmtOefZ+tYfyk7JRZo2Cx--6NHQ@mail.gmail.com>
	<CALPvCiBRwpL43wBdYBZt_JQXPuaeJD=twD0R_+F=BWFvGRXc2w@mail.gmail.com>
	<CALPvCiDHdtiKU=k5p5QS2f2MbHXT2Vk1WHTpJZvgpQbkkty8Hg@mail.gmail.com>
	<CAA2cxRNYeOKM-tvSNTZiad2r0U-BPm5z4umP3eU4FEY4ief4RA@mail.gmail.com>
	<CAMZUsP5boR5wAQZQQ3gC+WU-XFfRhAP+GYRnt8zcKM0UNqewYQ@mail.gmail.com>
Date: Sat, 9 Aug 2014 17:53:04 +0800
Message-ID: 
 <CAMZUsP6Juqf2Vb2iHokBGYFxsQad_HZxXvY=DQCToaBbohivpA@mail.gmail.com>
Subject: Re: IPC Queue Size
From: Qiang Tian <tianq01@gmail.com>
To: "user@hbase.apache.org" <user@hbase.apache.org>
Content-Type: multipart/alternative; boundary=001a11c22826cb917205002f498e

--001a11c22826cb917205002f498e
Content-Type: text/plain; charset=UTF-8

please see HBASE-11714


On Sat, Aug 9, 2014 at 4:27 PM, Qiang Tian <tianq01@gmail.com> wrote:

> took a quick look. it looks a bug, will dig more and update later.
>
>
>
> On Sat, Aug 9, 2014 at 1:45 AM, Walter King <walter@adroll.com> wrote:
>
>> awesome, thanks!
>> the resets and such were actually what I've been trying to investigate.
>>  Recently switched from 0.94 and 0.98, and finding that periodically
>> things
>> are having issues - lots of retry exceptions.  Don't have much
>> information,
>> still looking into it when saw this.
>>
>> By timestamp, this seems to be the corresponding client exception:
>>
>> 2014-08-08 17:22:43 o.a.h.h.c.AsyncProcess [INFO] #105158,
>> table=rt_global_monthly_campaign_deliveries, attempt=10/35 failed 500 ops,
>> last exception: java.net.SocketTimeoutException: Call to
>> ip-10-201-128-23.us-west-1.compute.internal/10.201.128.23:60020 failed
>> because java.net.SocketTimeoutException: 2000 millis timeout while waiting
>> for channel to be ready for read. ch :
>> java.nio.channels.SocketChannel[connected local=/10.248.130.152:46014
>> remote=ip-10-201-128-23.us-west-1.compute.internal/10.201.128.23:60020]
>> on
>> ip-10-201-128-23.us-west-1.compute.internal,60020,1405642103651, tracking
>> started Fri Aug 08 17:21:55 UTC 2014, retrying after 10043 ms, replay 500
>> ops.
>>
>> 2014-08-08 17:22:53 o.a.h.h.c.AsyncProcess [INFO] #105158, waiting for
>> some
>> tasks to finish. Expected max=0, tasksSent=11, tasksDone=10,
>> currentTasksDone=10, retries=10 hasError=false,
>> tableName=rt_global_monthly_campaign_deliveries
>>
>>
>> On Fri, Aug 8, 2014 at 2:15 AM, Esteban Gutierrez <esteban@cloudera.com>
>> wrote:
>>
>> > The RequestContext seems to be ok since it hasn't been set. So just
>> > decrementing the call size should be enough.
>> >
>> > cheers,
>> > esteban.
>> >
>> >
>> > --
>> > Cloudera, Inc.
>> >
>> >
>> >
>> > On Fri, Aug 8, 2014 at 2:10 AM, Esteban Gutierrez <esteban@cloudera.com
>> >
>> > wrote:
>> >
>> > >
>> > > Looks like after HBASE-10506 we didn't do what CallRunner used to do
>> > > before: clear the RequestContext and decrement callQueueSize.
>> > >
>> > > Created HBASE-11705 to address this issue.
>> > >
>> > >
>> > >
>> > >
>> > >
>> > >
>> > > --
>> > > Cloudera, Inc.
>> > >
>> > >
>> > >
>> > > On Fri, Aug 8, 2014 at 1:56 AM, Qiang Tian <tianq01@gmail.com> wrote:
>> > >
>> > >> good point.  that is a big suspect.
>> > >>
>> > >> I check your log, ClosedChannelException should be triggered by
>> > >>  call.sendResponseIfReady()(it is the only request in the queue, so
>> > >> handler
>> > >> send response directly), but at that point the callqueueSize has been
>> > >> decremented.
>> > >>
>> > >> 2014-08-05 00:50:06,727 WARN  [RpcServer.handler=57,port=60020]
>> > >> ipc.RpcServer (RpcServer.java:processResponse(1041)) -
>> > >> RpcServer.respondercallId: 118504 service: ClientService methodName:
>> > Multi
>> > >> size: 141.9 K connection: 10.248.134.67:55347: output error
>> > >> 2014-08-05 00:50:06,727 WARN  [RpcServer.handler=57,port=60020]
>> > >> ipc.RpcServer (CallRunner.java:run(135)) -
>> > >> RpcServer.handler=57,port=60020:
>> > >> caught a ClosedChannelException, this means that the server was
>> > processing
>> > >> a request but the client went away. The error message was: null
>> > >>
>> > >> it looks you have got the fix, would you file a jira?
>> > >> thanks.
>> > >>
>> > >>
>> > >> On Fri, Aug 8, 2014 at 2:41 PM, Walter King <walter@adroll.com>
>> wrote:
>> > >>
>> > >> > I've only looked at the code a little, and likely missed something,
>> > but
>> > >> > does this if block decrement the call queue, if the client already
>> > >> closed
>> > >> > the connection?
>> > >> >
>> > >> >
>> > >> >
>> > >>
>> >
>> https://github.com/apache/hbase/blob/07a771866f18e8ec532c14f624fa908815bd88c7/hbase-server/src/main/java/org/apache/hadoop/hbase/ipc/CallRunner.java#L74
>> > >> >
>> > >> >
>> > >> >
>> > >> > On Thu, Aug 7, 2014 at 11:32 PM, Walter King <walter@adroll.com>
>> > wrote:
>> > >> >
>> > >> > > Yes, sorry, CallQueueTooBigException. but that value never
>> returns
>> > to
>> > >> > > zero, even when number of requests goes to zero.  The call queue
>> too
>> > >> big
>> > >> > > happens if any regionserver is up for a long enough period of
>> time,
>> > >> so I
>> > >> > > have to periodically restart them.  Also at that 15:30 time I
>> wasn't
>> > >> > > seeing that exception, but it seems like that is one time in
>> which a
>> > >> call
>> > >> > > didnt properly decrement the callqueuesize because it was at zero
>> > >> before
>> > >> > > and has never hit zero again - today the minimum is even higher.
>> > >> > >
>> > >> > >
>> > >> > > On Thu, Aug 7, 2014 at 9:14 PM, Qiang Tian <tianq01@gmail.com>
>> > wrote:
>> > >> > >
>> > >> > >> bq. "Eventually we ran into ipc queue size full messages being
>> > >> returned
>> > >> > to
>> > >> > >> clients trying large batch puts, as it approaches a gigabyte."
>> > >> > >>
>> > >> > >> Do you mean CallQueueTooBigException? it looks not the queue
>> size,
>> > >> but
>> > >> > the
>> > >> > >> data size that client sends..configured by
>> > >> > >> "hbase.ipc.server.max.callqueue.size".
>> > >> > >>
>> > >> > >> I guess when you client got the exception, it closed the
>> exception
>> > >> and
>> > >> > >> causing other shared connection RPC failed.
>> > >> > >>
>> > >> > >>
>> > >> > >> 2014-08-06 22:27:57,253 WARN  [RpcServer.reader=9,port=60020]
>> > >> > >> ipc.RpcServer
>> > >> > >> (RpcServer.java:doRead(794)) - RpcServer.listener,port=60020:
>> count
>> > >> of
>> > >> > >> bytes read: 0
>> > >> > >> java.io.IOException: Connection reset by peer
>> > >> > >> at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
>> > >> > >> at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
>> > >> > >> at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
>> > >> > >> at sun.nio.ch.IOUtil.read(IOUtil.java:197)
>> > >> > >> at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379)
>> > >> > >> at
>> > >> >
>> org.apache.hadoop.hbase.ipc.RpcServer.channelRead(RpcServer.java:2229)
>> > >> > >> at
>> > >> > >>
>> > >> > >>
>> > >> >
>> > >>
>> >
>> org.apache.hadoop.hbase.ipc.RpcServer$Connection.readAndProcess(RpcServer.java:1415)
>> > >> > >> at
>> > >> > >>
>> > >> >
>> > >>
>> >
>> org.apache.hadoop.hbase.ipc.RpcServer$Listener.doRead(RpcServer.java:790)
>> > >> > >> at
>> > >> > >>
>> > >> > >>
>> > >> >
>> > >>
>> >
>> org.apache.hadoop.hbase.ipc.RpcServer$Listener$Reader.doRunLoop(RpcServer.java:581)
>> > >> > >> at
>> > >> > >>
>> > >> > >>
>> > >> >
>> > >>
>> >
>> org.apache.hadoop.hbase.ipc.RpcServer$Listener$Reader.run(RpcServer.java:556)
>> > >> > >> at
>> > >> > >>
>> > >> > >>
>> > >> >
>> > >>
>> >
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>> > >> > >> at
>> > >> > >>
>> > >> > >>
>> > >> >
>> > >>
>> >
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>> > >> > >> at java.lang.Thread.run(Thread.java:744)
>> > >> > >> 2014-08-06 22:27:57,257 WARN  [RpcServer.handler=18,port=60020]
>> > >> > >> ipc.RpcServer (RpcServer.java:processResponse(1041)) -
>> > >> > >> RpcServer.respondercallId: 84968 service: ClientService
>> methodName:
>> > >> > Multi
>> > >> > >> size: 17.7 K connection: 10.248.130.152:49780: output error
>> > >> > >> 2014-08-06 22:27:57,258 WARN  [RpcServer.handler=18,port=60020]
>> > >> > >> ipc.RpcServer (CallRunner.java:run(135)) -
>> > >> > >> RpcServer.handler=18,port=60020:
>> > >> > >> caught a ClosedChannelException, this means that the server was
>> > >> > processing
>> > >> > >> a request but the client went away. The error message was: null
>> > >> > >> 2014-08-06 22:27:57,260 WARN  [RpcServer.handler=61,port=60020]
>> > >> > >> ipc.RpcServer (RpcServer.java:processResponse(1041)) -
>> > >> > >> RpcServer.respondercallId: 83907 service: ClientService
>> methodName:
>> > >> > Multi
>> > >> > >> size: 17.1 K connection: 10.248.1.56:53615: output error
>> > >> > >> 2014-08-06 22:27:57,263 WARN  [RpcServer.handler=61,port=60020]
>> > >> > >> ipc.RpcServer (CallRunner.java:run(135)) -
>> > >> > >> RpcServer.handler=61,port=60020:
>> > >> > >> caught a ClosedChannelException, this means that the server was
>> > >> > processing
>> > >> > >> a request but the client went away. The error message was: null
>> > >> > >>
>> > >> > >>
>> > >> > >>
>> > >> > >> On Fri, Aug 8, 2014 at 2:57 AM, Walter King <walter@adroll.com>
>> > >> wrote:
>> > >> > >>
>> > >> > >> >
>> > >> > >>
>> > >> >
>> > >>
>> >
>> https://gist.github.com/walterking/4c5c6f5e5e4a4946a656#file-gistfile1-txt
>> > >> > >> >
>> > >> > >> >
>> > >> >
>> > http://adroll-test-sandbox.s3.amazonaws.com/regionserver.stdout.log.gz
>> > >> > >> >
>> > >> > >> > These are logs from that particular server, and the debug dump
>> > from
>> > >> > >> now(no
>> > >> > >> > restart in between).  The times in the graph are pacific, so
>> it
>> > >> should
>> > >> > >> be
>> > >> > >> > around 2014-08-06 22:25:00.  I do see some exceptions around
>> > there.
>> > >> > >> >
>> > >> > >>
>> > >> > >
>> > >> > >
>> > >> >
>> > >>
>> > >
>> > >
>> >
>>
>
>

--001a11c22826cb917205002f498e--