Return-Path: X-Original-To: apmail-hbase-user-archive@www.apache.org Delivered-To: apmail-hbase-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id D53A011C9A for ; Sat, 23 Aug 2014 18:28:43 +0000 (UTC) Received: (qmail 43543 invoked by uid 500); 23 Aug 2014 18:28:36 -0000 Delivered-To: apmail-hbase-user-archive@hbase.apache.org Received: (qmail 43463 invoked by uid 500); 23 Aug 2014 18:28:36 -0000 Mailing-List: contact user-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hbase.apache.org Delivered-To: mailing list user@hbase.apache.org Received: (qmail 43448 invoked by uid 99); 23 Aug 2014 18:28:36 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 23 Aug 2014 18:28:36 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS,WEIRD_PORT X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of saint.ack@gmail.com designates 209.85.215.51 as permitted sender) Received: from [209.85.215.51] (HELO mail-la0-f51.google.com) (209.85.215.51) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 23 Aug 2014 18:28:09 +0000 Received: by mail-la0-f51.google.com with SMTP id pn19so11022399lab.24 for ; Sat, 23 Aug 2014 11:28:08 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:in-reply-to:references:date:message-id:subject :from:to:content-type; bh=jvDMNe3Gu46bwipIw53k2Q8+jHVoqQXx7WYuFdga+HU=; b=xCr5j5pPKmTKNS52Yuc/yFArjKK3MuEiH+OLcfKHccCiHAT3JCW9vJq2jnR5TOhbBx 81VRhtJ9BlMH6aZgnqbF0rVWuzXbj8JjHvYLRZ6nN2vnkozVm3O0CP0Uzy/NzqzhQzW8 HWbd01h5YM0fLLZJhHqqxlBxe9+1+XS4lSV457SR4fYu8tIZyPtEeRbw6sgmxw9YXa7u at8ljR2SLO0dFlaga6vrrTjqOWTzFWhOZ4Wh/84RJ6FOc4OcU26GHJG8ZRAcaAN6wFXG qZHFgAQO4JwejIeXaYtAozAUZqcDeFobIFhdwC9jWsFT6OCQjKp0TT67YcNZPMGGEpzI yKnQ== MIME-Version: 1.0 X-Received: by 10.112.141.193 with SMTP id rq1mr10521427lbb.25.1408818488193; Sat, 23 Aug 2014 11:28:08 -0700 (PDT) Sender: saint.ack@gmail.com Received: by 10.152.148.170 with HTTP; Sat, 23 Aug 2014 11:28:08 -0700 (PDT) In-Reply-To: References: Date: Sat, 23 Aug 2014 11:28:08 -0700 X-Google-Sender-Auth: 3FvC1wG1UBw1yBIdCnp93G-2jp0 Message-ID: Subject: Re: single RegionServer stuck, causing cluster to hang From: Stack To: Hbase-User Content-Type: multipart/alternative; boundary=001a11c381628ab8110501501d04 X-Virus-Checked: Checked by ClamAV on apache.org --001a11c381628ab8110501501d04 Content-Type: text/plain; charset=UTF-8 Anything in your .out that could help explain our losing handlers if you can't find anything in the logs? You did the 'snipp' in the below, right Johannes? RS Configuration: =========================================================== [snipp] no fancy stuff, all default, except absolute necessary settings [snipp] As per Qian, if hbase.ipc.server.callqueue.handler.factor, that could help explain why we have handlers but they are not 'taking' from the call queue, they are stuck taking on those queues that do not have calls queued. St.Ack On Sat, Aug 23, 2014 at 2:56 AM, Qiang Tian wrote: > Did you set hbase.ipc.server.callqueue.handler.factor? > it looks there are 3 queues, handlers on queue 1 are all gone as Stack > mentioned. jstack and pastebin regions server log would help. > > > > > > On Sat, Aug 23, 2014 at 7:02 AM, Stack wrote: > > > On Fri, Aug 22, 2014 at 3:24 PM, Johannes Schaback < > > johannes.schaback@visual-meta.com> wrote: > > > > > ... > > > I grep'ed "defaultRpcServer.handler=" on the log from that particular > RS. > > > The > > > RS started at 15:35. After that, the handlers > > > > > > 6, 24, 0, 15, 28, 26, 7, 19, 21, 3, 5 and 23 > > > > > > make an appearance in error messages of HDFS related exceptions: > > > > > > 2014-08-22 13:54:57,470 WARN > > > [defaultRpcServer.handler=6,queue=0,port=60020] > hdfs.BlockReaderFactory: > > > I/O error constructing remote block reader. > > > 2014-08-22 13:54:57,472 WARN > > > [defaultRpcServer.handler=6,queue=0,port=60020] hdfs.DFSClient: > > Connection > > > failure: Failed to connect to /192.168.3.233:50010 for file > > > /hbase/data/default/image/dae1b8e3bfdb608571d09916bf0f > > > a156/cf/866a773857654b0d83275dc4e4558be6 for block > > > > > > > > > BP-1157637685-192.168.3.192-1382642140917:blk_1294949058_255264636:java.io.IOException: > > > Got error for OP_READ_BLOCK, self=/192.168.3.179:37835, remote=/ > > > 192.168.3.233:50 010, for file > > > > > > > > > /hbase/data/default/image/dae1b8e3bfdb608571d09916bf0fa156/cf/866a773857654b0d83275dc4e4558be6, > > > for pool BP-1157637685-192.168.3.192-1382642140917 block > > > 1294949058_255264636 > > > 2014-08-22 13:56:58,525 WARN > > > [defaultRpcServer.handler=24,queue=0,port=60020] > > hdfs.BlockReaderFactory: > > > I/O error constructing remote block reader. > > > 2014-08-22 13:56:58,799 WARN > > > [defaultRpcServer.handler=24,queue=0,port=60020] hdfs.DFSClient: > > > Connection failure: Failed to connect to /192.168.3.142:50010 for file > > > /hbase/data/default/image/cf493cbd9921ae6ca5e5281cc07 > > > 18ca2/cf/7f9618dcdeae40ddbff21165d08e0a83 for block > > > > > > > > > BP-1157637685-192.168.3.192-1382642140917:blk_1294072754_254292863:java.io.IOException: > > > Got error for OP_READ_BLOCK, self=/192.168.3.179:59268, remote=/ > > > 192.168.3.142:5 0010, for file > > > > > > > > > /hbase/data/default/image/cf493cbd9921ae6ca5e5281cc0718ca2/cf/7f9618dcdeae40ddbff21165d08e0a83, > > > for pool BP-1157637685-192.168.3.192-1382642140917 block > > > 1294072754_254292863 > > > 2014-08-22 14:08:01,632 WARN > > > [defaultRpcServer.handler=0,queue=0,port=60020] > hdfs.BlockReaderFactory: > > > I/O error constructing remote block reader. > > > 2014-08-22 14:08:01,633 WARN > > > [defaultRpcServer.handler=0,queue=0,port=60020] hdfs.DFSClient: > > Connection > > > failure: Failed to connect to /192.168.3.53:50010 for file > > > /hbase/data/default/image/af02bc7fb404f4c054dcd64b44b0e > > > 2a9/cf/a8881b6e2d5d41b3b56fd34fd4ca8ffd for block > > > > > > > > > BP-1157637685-192.168.3.192-1382642140917:blk_1294876123_255182439:java.io.IOException: > > > Got error for OP_READ_BLOCK, self=/192.168.3.179:60737, remote=/ > > > 192.168.3.53:5001 0, for file > > > > > > > > > /hbase/data/default/image/af02bc7fb404f4c054dcd64b44b0e2a9/cf/a8881b6e2d5d41b3b56fd34fd4ca8ffd, > > > for pool BP-1157637685-192.168.3.192-1382642140917 block > > > 1294876123_255182439 > > > 2014-08-22 14:11:34,192 WARN > > > [defaultRpcServer.handler=15,queue=0,port=60020] ipc.RpcServer: > > > (responseTooSlow): > > > > > > > > > {"processingtimems":15432,"call":"Multi(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$MultiReque > > > st)","client":"192.168.3.54:52838 > > > > > > > > > ","starttimems":1408709478713,"queuetimems":0,"class":"HRegionServer","responsesize":99638,"method":"Multi"} > > > 2014-08-22 14:22:21,847 WARN > > > [defaultRpcServer.handler=28,queue=1,port=60020] > > hdfs.BlockReaderFactory: > > > I/O error constructing remote block reader. > > > 2014-08-22 14:22:21,848 WARN > > > [defaultRpcServer.handler=28,queue=1,port=60020] hdfs.DFSClient: > > > Connection failure: Failed to connect to /192.168.3.233:50010 for file > > > /hbase/data/default/image/1d33a251862055cd999078bbd10 > > > aa44c/cf/a7470351725f4cc192b6210bac9b7c44 for block > > > > > > > > > BP-1157637685-192.168.3.192-1382642140917:blk_1294949316_255264902:java.io.IOException: > > > Got error for OP_READ_BLOCK, self=/192.168.3.179:44787, remote=/ > > > 192.168.3.233:5 0010, for file > > > > > > > > > /hbase/data/default/image/1d33a251862055cd999078bbd10aa44c/cf/a7470351725f4cc192b6210bac9b7c44, > > > for pool BP-1157637685-192.168.3.192-1382642140917 block > > > 1294949316_255264902 > > > 2014-08-22 14:23:22,628 WARN > > > [defaultRpcServer.handler=26,queue=2,port=60020] > > hdfs.BlockReaderFactory: > > > I/O error constructing remote block reader. > > > 2014-08-22 14:23:22,628 WARN > > > [defaultRpcServer.handler=26,queue=2,port=60020] hdfs.DFSClient: > > > Connection failure: Failed to connect to /192.168.3.142:50010 for file > > > /hbase/data/default/image/4a9c830a4fe006f0a6af7418164 > > > dd86d/cf/47d6f2c3f7054d40aed66cae9787c464 for block > > > > > > > > > BP-1157637685-192.168.3.192-1382642140917:blk_1293459594_253612452:java.io.IOException: > > > Got error for OP_READ_BLOCK, self=/192.168.3.179:37660, remote=/ > > > 192.168.3.142:5 0010, for file > > > > > > > > > /hbase/data/default/image/4a9c830a4fe006f0a6af7418164dd86d/cf/47d6f2c3f7054d40aed66cae9787c464, > > > for pool BP-1157637685-192.168.3.192-1382642140917 block > > > 1293459594_253612452 > > > 2014-08-22 14:25:35,003 WARN > > > [defaultRpcServer.handler=7,queue=1,port=60020] > hdfs.BlockReaderFactory: > > > I/O error constructing remote block reader. > > > 2014-08-22 14:25:35,004 WARN > > > [defaultRpcServer.handler=7,queue=1,port=60020] hdfs.DFSClient: > > Connection > > > failure: Failed to connect to /192.168.3.51:50010 for file > > > /hbase/data/default/run_automaton_cache/da03f8123004be3 > > > 2659e1c8a51afbbf8/cf/1daf234e740f4b00889bb60e574dc79b for block > > > > > > > > > BP-1157637685-192.168.3.192-1382642140917:blk_1294614349_254896345:java.io.IOException: > > > Got error for OP_READ_BLOCK, self=/192.168.3.179:53627, remote=/192 > > > .168.3.51:50010, for file > > > > > > > > > /hbase/data/default/run_automaton_cache/da03f8123004be32659e1c8a51afbbf8/cf/1daf234e740f4b00889bb60e574dc79b, > > > for pool BP-1157637685-192.168.3.192-1382642140917 block > > > 1294614349_254896345 > > > 2014-08-22 14:25:46,831 WARN > > > [defaultRpcServer.handler=19,queue=1,port=60020] > > hdfs.BlockReaderFactory: > > > I/O error constructing remote block reader. > > > 2014-08-22 14:25:46,832 WARN > > > [defaultRpcServer.handler=19,queue=1,port=60020] hdfs.DFSClient: > > > Connection failure: Failed to connect to /192.168.3.30:50010 for file > > > /hbase/data/default/image/1c3bab43e260ddb46a06cf04e293 > > > e386/cf/740f8240ac9a4abc9e5fcf6ec7df18bc for block > > > > > > > > > BP-1157637685-192.168.3.192-1382642140917:blk_1294151564_254380545:java.io.IOException: > > > Got error for OP_READ_BLOCK, self=/192.168.3.179:55660, remote=/ > > > 192.168.3.30:500 10, for file > > > > > > > > > /hbase/data/default/image/1c3bab43e260ddb46a06cf04e293e386/cf/740f8240ac9a4abc9e5fcf6ec7df18bc, > > > for pool BP-1157637685-192.168.3.192-1382642140917 block > > > 1294151564_254380545 > > > 2014-08-22 14:28:22,395 WARN > > > [defaultRpcServer.handler=26,queue=2,port=60020] > > hdfs.BlockReaderFactory: > > > I/O error constructing remote block reader. > > > 2014-08-22 14:28:22,397 WARN > > > [defaultRpcServer.handler=26,queue=2,port=60020] hdfs.DFSClient: > > > Connection failure: Failed to connect to /192.168.3.143:50010 for file > > > /hbase/data/default/image/b4b48e20a606c431e393b674f92 > > > 79daf/cf/76ff3628306a48f187a285b2a21d9ac9 for block > > > > > > > > > BP-1157637685-192.168.3.192-1382642140917:blk_1294308080_254554506:java.io.IOException: > > > Got error for OP_READ_BLOCK, self=/192.168.3.179:51759, remote=/ > > > 192.168.3.143:5 0010, for file > > > > > > > > > /hbase/data/default/image/b4b48e20a606c431e393b674f9279daf/cf/76ff3628306a48f187a285b2a21d9ac9, > > > for pool BP-1157637685-192.168.3.192-1382642140917 block > > > 1294308080_254554506 > > > 2014-08-22 14:30:29,395 WARN > > > [defaultRpcServer.handler=21,queue=0,port=60020] > > hdfs.BlockReaderFactory: > > > I/O error constructing remote block reader. > > > 2014-08-22 14:30:29,396 WARN > > > [defaultRpcServer.handler=21,queue=0,port=60020] hdfs.DFSClient: > > > Connection failure: Failed to connect to /192.168.3.30:50010 for file > > > /hbase/data/default/image/77ceb80d73c065f1a4ba2ad3b7cf > > > 04a2/cf/151c244177ac42b5996af0c9052660cc for block > > > > > > > > > BP-1157637685-192.168.3.192-1382642140917:blk_1294365296_254618150:java.io.IOException: > > > Got error for OP_READ_BLOCK, self=/192.168.3.179:57238, remote=/ > > > 192.168.3.30:500 10, for file > > > > > > > > > /hbase/data/default/image/77ceb80d73c065f1a4ba2ad3b7cf04a2/cf/151c244177ac42b5996af0c9052660cc, > > > for pool BP-1157637685-192.168.3.192-1382642140917 block > > > 1294365296_254618150 > > > 2014-08-22 14:31:34,016 WARN > > > [defaultRpcServer.handler=6,queue=0,port=60020] > hdfs.BlockReaderFactory: > > > I/O error constructing remote block reader. > > > 2014-08-22 14:31:34,018 WARN > > > [defaultRpcServer.handler=6,queue=0,port=60020] hdfs.DFSClient: > > Connection > > > failure: Failed to connect to /192.168.3.212:50010 for file > > > /hbase/data/default/image/45578853a5c807919578043c7715 > > > 1efa/cf/40c00d49a7494929a319467d82b383da for block > > > > > > > > > BP-1157637685-192.168.3.192-1382642140917:blk_1294930505_255244337:java.io.IOException: > > > Got error for OP_READ_BLOCK, self=/192.168.3.179:33973, remote=/ > > > 192.168.3.212:50 010, for file > > > > > > > > > /hbase/data/default/image/45578853a5c807919578043c77151efa/cf/40c00d49a7494929a319467d82b383da, > > > for pool BP-1157637685-192.168.3.192-1382642140917 block > > > 1294930505_255244337 > > > 2014-08-22 14:33:55,994 WARN > > > [defaultRpcServer.handler=3,queue=0,port=60020] > hdfs.BlockReaderFactory: > > > I/O error constructing remote block reader. > > > 2014-08-22 14:33:55,995 WARN > > > [defaultRpcServer.handler=3,queue=0,port=60020] hdfs.DFSClient: > > Connection > > > failure: Failed to connect to /192.168.3.52:50010 for file > > > /hbase/data/default/image/1f07e0160fbba29a2c26104dd5966 > > > 39f/cf/7e36136497fc4a5a9889797a1dfb5d3b for block > > > > > > > > > BP-1157637685-192.168.3.192-1382642140917:blk_1294915607_255226575:java.io.IOException: > > > Got error for OP_READ_BLOCK, self=/192.168.3.179:41017, remote=/ > > > 192.168.3.52:5001 0, for file > > > > > > > > > /hbase/data/default/image/1f07e0160fbba29a2c26104dd596639f/cf/7e36136497fc4a5a9889797a1dfb5d3b, > > > for pool BP-1157637685-192.168.3.192-1382642140917 block > > > 1294915607_255226575 > > > 2014-08-22 14:44:14,301 WARN > > > [defaultRpcServer.handler=3,queue=0,port=60020] > hdfs.BlockReaderFactory: > > > I/O error constructing remote block reader. > > > 2014-08-22 14:44:14,302 WARN > > > [defaultRpcServer.handler=3,queue=0,port=60020] hdfs.DFSClient: > > Connection > > > failure: Failed to connect to /192.168.3.154:50010 for file > > > /hbase/data/default/image/bb082a5f098d1d2a95365545349c > > > 77a4/cf/1af06c6fd36045bab888b4ec18e45f0f for block > > > > > > > > > BP-1157637685-192.168.3.192-1382642140917:blk_1294911152_255221397:java.io.IOException: > > > Got error for OP_READ_BLOCK, self=/192.168.3.179:44711, remote=/ > > > 192.168.3.154:50 010, for file > > > > > > > > > /hbase/data/default/image/bb082a5f098d1d2a95365545349c77a4/cf/1af06c6fd36045bab888b4ec18e45f0f, > > > for pool BP-1157637685-192.168.3.192-1382642140917 block > > > 1294911152_255221397 > > > 77a4/cf/1af06c6fd36045bab888b4ec18e45f0f for block > > > > > > > > > BP-1157637685-192.168.3.192-1382642140917:blk_1294911152_255221397:java.io.IOException: > > > Got error for OP_READ_BLOCK, self=/192.168.3.179:44711, remote=/ > > > 192.168.3.154:50 010, for file > > > > > > > > > /hbase/data/default/image/bb082a5f098d1d2a95365545349c77a4/cf/1af06c6fd36045bab888b4ec18e45f0f, > > > for pool BP-1157637685-192.168.3.192-1382642140917 block > > > 1294911152_255221397 > > > 2014-08-22 14:49:14,780 WARN > > > [defaultRpcServer.handler=5,queue=2,port=60020] ipc.RpcServer: > > > (responseTooSlow): > > > > > > > > > {"processingtimems":15494,"call":"Multi(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$MultiReques > > > t)","client":"192.168.3.11:53884 > > > > > > > > > ","starttimems":1408711739283,"queuetimems":0,"class":"HRegionServer","responsesize":483492,"method":"Multi"} > > > 2014-08-22 14:50:37,900 WARN > > > [defaultRpcServer.handler=7,queue=1,port=60020] > hdfs.BlockReaderFactory: > > > I/O error constructing remote block reader. > > > 2014-08-22 14:50:37,901 WARN > > > [defaultRpcServer.handler=7,queue=1,port=60020] hdfs.DFSClient: > > Connection > > > failure: Failed to connect to /192.168.3.64:50010 for file > > > /hbase/data/default/image/a50d67b81c44c864265f5030c7c39 > > > 959/cf/840ff51e591946c487413273d5341a24 for block > > > > > > > > > BP-1157637685-192.168.3.192-1382642140917:blk_1294309355_254555905:java.io.IOException: > > > Got error for OP_READ_BLOCK, self=/192.168.3.179:45043, remote=/ > > > 192.168.3.64:5001 0, for file > > > > > > > > > /hbase/data/default/image/a50d67b81c44c864265f5030c7c39959/cf/840ff51e591946c487413273d5341a24, > > > for pool BP-1157637685-192.168.3.192-1382642140917 block > > > 1294309355_254555905 > > > 2014-08-22 15:09:35,289 WARN > > > [defaultRpcServer.handler=23,queue=2,port=60020] > > hdfs.BlockReaderFactory: > > > I/O error constructing remote block reader. > > > 2014-08-22 15:09:35,289 WARN > > > [defaultRpcServer.handler=23,queue=2,port=60020] hdfs.DFSClient: > > > Connection failure: Failed to connect to /192.168.3.63:50010 for file > > > /hbase/data/default/image/1db3051c8d41943892c0230cb75b > > > c1f2/cf/75ae96ea8ca2477e8f83291d7e1fe7cb for block > > > > > > > > > BP-1157637685-192.168.3.192-1382642140917:blk_1293205473_253317737:java.io.IOException: > > > Got error for OP_READ_BLOCK, self=/192.168.3.179:56326, remote=/ > > > 192.168.3.63:500 10, for file > > > > > > > > > /hbase/data/default/image/1db3051c8d41943892c0230cb75bc1f2/cf/75ae96ea8ca2477e8f83291d7e1fe7cb, > > > for pool BP-1157637685-192.168.3.192-1382642140917 block > > > 1293205473_253317737 > > > 2014-08-22 15:18:38,891 WARN > > > [defaultRpcServer.handler=5,queue=2,port=60020] ipc.RpcServer: > > > (responseTooSlow): > > > > > > > > > {"processingtimems":26092,"call":"Multi(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$MultiReques > > > t)","client":"192.168.3.91:40189 > > > > > > > > > ","starttimems":1408713492722,"queuetimems":0,"class":"HRegionServer","responsesize":483492,"method":"Multi"} > > > > > > What's interesting is that not all crashed handlers are later missing > > after > > > the freeze. > > > > > > Does that substantiate your conjecture? > > > > > > > > I saw those in your log. They are WARN from DFSClient which should be > ok. > > Any other exceptions going on in your logs that could explain our > handler > > loss? Or, can you see any thing particular about say the last mention of > > defaultRpcServer.handler=6 say? > > > > St.Ack > > > --001a11c381628ab8110501501d04--