Return-Path: X-Original-To: apmail-hbase-user-archive@www.apache.org Delivered-To: apmail-hbase-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 27163C337 for ; Sat, 9 Aug 2014 09:53:32 +0000 (UTC) Received: (qmail 42130 invoked by uid 500); 9 Aug 2014 09:53:30 -0000 Delivered-To: apmail-hbase-user-archive@hbase.apache.org Received: (qmail 42078 invoked by uid 500); 9 Aug 2014 09:53:30 -0000 Mailing-List: contact user-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hbase.apache.org Delivered-To: mailing list user@hbase.apache.org Received: (qmail 42066 invoked by uid 99); 9 Aug 2014 09:53:29 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 09 Aug 2014 09:53:29 +0000 X-ASF-Spam-Status: No, hits=1.8 required=5.0 tests=FREEMAIL_ENVFROM_END_DIGIT,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS,WEIRD_PORT X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of tianq01@gmail.com designates 209.85.192.49 as permitted sender) Received: from [209.85.192.49] (HELO mail-qg0-f49.google.com) (209.85.192.49) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 09 Aug 2014 09:53:26 +0000 Received: by mail-qg0-f49.google.com with SMTP id j107so7060911qga.22 for ; Sat, 09 Aug 2014 02:53:05 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=YfROh4TkYbcigfJsWYCaFE0iP8NjZZSzRgbU8VpNOUE=; b=SIR/ydYEtNUWTvXSyhFj0Jgf6voOow5707FjfejkIKRUF9Tk2931ftvvXCUAhEcsYu tBaeSFraS9hEnRb6gb4DOS53VsNwebUWeV0xUTrme7tGpVGwZqpjhCbv/gVx3VJRxUXa OmcqIaqYMZJTgJAseHp/r3wZZsm2McXdOHnBySVqXEdUvAB1tcaIN7/Gn97AQbLMUeoA i2fRM2JuEhhB4gVwB92dTkIBrwOO93cBOramS7j+EKJvowNbnBocOEAgzXkHszMN7t7p ybIbBvHHR5bcwMDaFT0QwVMDaaHj3jCTu4GQOvc/f6itoJRm19JkVjd/dT8Y53d0iKly Y1Sg== MIME-Version: 1.0 X-Received: by 10.224.136.70 with SMTP id q6mr45712623qat.14.1407577985092; Sat, 09 Aug 2014 02:53:05 -0700 (PDT) Received: by 10.140.29.55 with HTTP; Sat, 9 Aug 2014 02:53:04 -0700 (PDT) In-Reply-To: References: Date: Sat, 9 Aug 2014 17:53:04 +0800 Message-ID: Subject: Re: IPC Queue Size From: Qiang Tian To: "user@hbase.apache.org" Content-Type: multipart/alternative; boundary=001a11c22826cb917205002f498e X-Virus-Checked: Checked by ClamAV on apache.org --001a11c22826cb917205002f498e Content-Type: text/plain; charset=UTF-8 please see HBASE-11714 On Sat, Aug 9, 2014 at 4:27 PM, Qiang Tian wrote: > took a quick look. it looks a bug, will dig more and update later. > > > > On Sat, Aug 9, 2014 at 1:45 AM, Walter King wrote: > >> awesome, thanks! >> the resets and such were actually what I've been trying to investigate. >> Recently switched from 0.94 and 0.98, and finding that periodically >> things >> are having issues - lots of retry exceptions. Don't have much >> information, >> still looking into it when saw this. >> >> By timestamp, this seems to be the corresponding client exception: >> >> 2014-08-08 17:22:43 o.a.h.h.c.AsyncProcess [INFO] #105158, >> table=rt_global_monthly_campaign_deliveries, attempt=10/35 failed 500 ops, >> last exception: java.net.SocketTimeoutException: Call to >> ip-10-201-128-23.us-west-1.compute.internal/10.201.128.23:60020 failed >> because java.net.SocketTimeoutException: 2000 millis timeout while waiting >> for channel to be ready for read. ch : >> java.nio.channels.SocketChannel[connected local=/10.248.130.152:46014 >> remote=ip-10-201-128-23.us-west-1.compute.internal/10.201.128.23:60020] >> on >> ip-10-201-128-23.us-west-1.compute.internal,60020,1405642103651, tracking >> started Fri Aug 08 17:21:55 UTC 2014, retrying after 10043 ms, replay 500 >> ops. >> >> 2014-08-08 17:22:53 o.a.h.h.c.AsyncProcess [INFO] #105158, waiting for >> some >> tasks to finish. Expected max=0, tasksSent=11, tasksDone=10, >> currentTasksDone=10, retries=10 hasError=false, >> tableName=rt_global_monthly_campaign_deliveries >> >> >> On Fri, Aug 8, 2014 at 2:15 AM, Esteban Gutierrez >> wrote: >> >> > The RequestContext seems to be ok since it hasn't been set. So just >> > decrementing the call size should be enough. >> > >> > cheers, >> > esteban. >> > >> > >> > -- >> > Cloudera, Inc. >> > >> > >> > >> > On Fri, Aug 8, 2014 at 2:10 AM, Esteban Gutierrez > > >> > wrote: >> > >> > > >> > > Looks like after HBASE-10506 we didn't do what CallRunner used to do >> > > before: clear the RequestContext and decrement callQueueSize. >> > > >> > > Created HBASE-11705 to address this issue. >> > > >> > > >> > > >> > > >> > > >> > > >> > > -- >> > > Cloudera, Inc. >> > > >> > > >> > > >> > > On Fri, Aug 8, 2014 at 1:56 AM, Qiang Tian wrote: >> > > >> > >> good point. that is a big suspect. >> > >> >> > >> I check your log, ClosedChannelException should be triggered by >> > >> call.sendResponseIfReady()(it is the only request in the queue, so >> > >> handler >> > >> send response directly), but at that point the callqueueSize has been >> > >> decremented. >> > >> >> > >> 2014-08-05 00:50:06,727 WARN [RpcServer.handler=57,port=60020] >> > >> ipc.RpcServer (RpcServer.java:processResponse(1041)) - >> > >> RpcServer.respondercallId: 118504 service: ClientService methodName: >> > Multi >> > >> size: 141.9 K connection: 10.248.134.67:55347: output error >> > >> 2014-08-05 00:50:06,727 WARN [RpcServer.handler=57,port=60020] >> > >> ipc.RpcServer (CallRunner.java:run(135)) - >> > >> RpcServer.handler=57,port=60020: >> > >> caught a ClosedChannelException, this means that the server was >> > processing >> > >> a request but the client went away. The error message was: null >> > >> >> > >> it looks you have got the fix, would you file a jira? >> > >> thanks. >> > >> >> > >> >> > >> On Fri, Aug 8, 2014 at 2:41 PM, Walter King >> wrote: >> > >> >> > >> > I've only looked at the code a little, and likely missed something, >> > but >> > >> > does this if block decrement the call queue, if the client already >> > >> closed >> > >> > the connection? >> > >> > >> > >> > >> > >> > >> > >> >> > >> https://github.com/apache/hbase/blob/07a771866f18e8ec532c14f624fa908815bd88c7/hbase-server/src/main/java/org/apache/hadoop/hbase/ipc/CallRunner.java#L74 >> > >> > >> > >> > >> > >> > >> > >> > On Thu, Aug 7, 2014 at 11:32 PM, Walter King >> > wrote: >> > >> > >> > >> > > Yes, sorry, CallQueueTooBigException. but that value never >> returns >> > to >> > >> > > zero, even when number of requests goes to zero. The call queue >> too >> > >> big >> > >> > > happens if any regionserver is up for a long enough period of >> time, >> > >> so I >> > >> > > have to periodically restart them. Also at that 15:30 time I >> wasn't >> > >> > > seeing that exception, but it seems like that is one time in >> which a >> > >> call >> > >> > > didnt properly decrement the callqueuesize because it was at zero >> > >> before >> > >> > > and has never hit zero again - today the minimum is even higher. >> > >> > > >> > >> > > >> > >> > > On Thu, Aug 7, 2014 at 9:14 PM, Qiang Tian >> > wrote: >> > >> > > >> > >> > >> bq. "Eventually we ran into ipc queue size full messages being >> > >> returned >> > >> > to >> > >> > >> clients trying large batch puts, as it approaches a gigabyte." >> > >> > >> >> > >> > >> Do you mean CallQueueTooBigException? it looks not the queue >> size, >> > >> but >> > >> > the >> > >> > >> data size that client sends..configured by >> > >> > >> "hbase.ipc.server.max.callqueue.size". >> > >> > >> >> > >> > >> I guess when you client got the exception, it closed the >> exception >> > >> and >> > >> > >> causing other shared connection RPC failed. >> > >> > >> >> > >> > >> >> > >> > >> 2014-08-06 22:27:57,253 WARN [RpcServer.reader=9,port=60020] >> > >> > >> ipc.RpcServer >> > >> > >> (RpcServer.java:doRead(794)) - RpcServer.listener,port=60020: >> count >> > >> of >> > >> > >> bytes read: 0 >> > >> > >> java.io.IOException: Connection reset by peer >> > >> > >> at sun.nio.ch.FileDispatcherImpl.read0(Native Method) >> > >> > >> at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) >> > >> > >> at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) >> > >> > >> at sun.nio.ch.IOUtil.read(IOUtil.java:197) >> > >> > >> at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379) >> > >> > >> at >> > >> > >> org.apache.hadoop.hbase.ipc.RpcServer.channelRead(RpcServer.java:2229) >> > >> > >> at >> > >> > >> >> > >> > >> >> > >> > >> > >> >> > >> org.apache.hadoop.hbase.ipc.RpcServer$Connection.readAndProcess(RpcServer.java:1415) >> > >> > >> at >> > >> > >> >> > >> > >> > >> >> > >> org.apache.hadoop.hbase.ipc.RpcServer$Listener.doRead(RpcServer.java:790) >> > >> > >> at >> > >> > >> >> > >> > >> >> > >> > >> > >> >> > >> org.apache.hadoop.hbase.ipc.RpcServer$Listener$Reader.doRunLoop(RpcServer.java:581) >> > >> > >> at >> > >> > >> >> > >> > >> >> > >> > >> > >> >> > >> org.apache.hadoop.hbase.ipc.RpcServer$Listener$Reader.run(RpcServer.java:556) >> > >> > >> at >> > >> > >> >> > >> > >> >> > >> > >> > >> >> > >> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) >> > >> > >> at >> > >> > >> >> > >> > >> >> > >> > >> > >> >> > >> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) >> > >> > >> at java.lang.Thread.run(Thread.java:744) >> > >> > >> 2014-08-06 22:27:57,257 WARN [RpcServer.handler=18,port=60020] >> > >> > >> ipc.RpcServer (RpcServer.java:processResponse(1041)) - >> > >> > >> RpcServer.respondercallId: 84968 service: ClientService >> methodName: >> > >> > Multi >> > >> > >> size: 17.7 K connection: 10.248.130.152:49780: output error >> > >> > >> 2014-08-06 22:27:57,258 WARN [RpcServer.handler=18,port=60020] >> > >> > >> ipc.RpcServer (CallRunner.java:run(135)) - >> > >> > >> RpcServer.handler=18,port=60020: >> > >> > >> caught a ClosedChannelException, this means that the server was >> > >> > processing >> > >> > >> a request but the client went away. The error message was: null >> > >> > >> 2014-08-06 22:27:57,260 WARN [RpcServer.handler=61,port=60020] >> > >> > >> ipc.RpcServer (RpcServer.java:processResponse(1041)) - >> > >> > >> RpcServer.respondercallId: 83907 service: ClientService >> methodName: >> > >> > Multi >> > >> > >> size: 17.1 K connection: 10.248.1.56:53615: output error >> > >> > >> 2014-08-06 22:27:57,263 WARN [RpcServer.handler=61,port=60020] >> > >> > >> ipc.RpcServer (CallRunner.java:run(135)) - >> > >> > >> RpcServer.handler=61,port=60020: >> > >> > >> caught a ClosedChannelException, this means that the server was >> > >> > processing >> > >> > >> a request but the client went away. The error message was: null >> > >> > >> >> > >> > >> >> > >> > >> >> > >> > >> On Fri, Aug 8, 2014 at 2:57 AM, Walter King >> > >> wrote: >> > >> > >> >> > >> > >> > >> > >> > >> >> > >> > >> > >> >> > >> https://gist.github.com/walterking/4c5c6f5e5e4a4946a656#file-gistfile1-txt >> > >> > >> > >> > >> > >> > >> > >> > >> > http://adroll-test-sandbox.s3.amazonaws.com/regionserver.stdout.log.gz >> > >> > >> > >> > >> > >> > These are logs from that particular server, and the debug dump >> > from >> > >> > >> now(no >> > >> > >> > restart in between). The times in the graph are pacific, so >> it >> > >> should >> > >> > >> be >> > >> > >> > around 2014-08-06 22:25:00. I do see some exceptions around >> > there. >> > >> > >> > >> > >> > >> >> > >> > > >> > >> > > >> > >> > >> > >> >> > > >> > > >> > >> > > --001a11c22826cb917205002f498e--