hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Yu <yuzhih...@gmail.com>
Subject Re: single RegionServer stuck, causing cluster to hang
Date Sat, 23 Aug 2014 19:39:24 GMT
Can you show the complete stack trace for StackOverflowException (using pastebin) ?

Thanks

On Aug 23, 2014, at 12:11 PM, Johannes Schaback <johannes.schaback@visual-meta.com>
wrote:

> Hi,
> 
> we had to reduce load on the cluster yesterday night which reduced the
> frequency of the phenomenon. That is why I could not get a jstack dump yet
> because it did not occur since a couple hours. We will now get the load
> back up hoping to trigger it again.
> 
> Yes, I cut out the properties from the /debug dump because they are all
> standard. We have hbase.ipc.server.callqueue.handler.factor to the standard
> 
> <property>
> <name>hbase.ipc.server.callqueue.handler.factor</name>
> <value>0.1</value>
> <source>hbase-default.xml</source>
> </property>
> 
> You find the complete config of the RS here: http://pastebin.com/iF1ibFb1
> 
> The hint about the .out files was a great one (I never really looked at
> them actually). The .out files are flooded with StackOverflowExceptions:
> 
> Exception in thread "defaultRpcServer.handler=5,queue=2,port=60020"
> java.lang.StackOverflowError
>        at org.apache.hadoop.hbase.CellUtil$1.advance(CellUtil.java:210)
>        at org.apache.hadoop.hbase.CellUtil$1.advance(CellUtil.java:210)
>        at org.apache.hadoop.hbase.CellUtil$1.advance(CellUtil.java:210)
>        at org.apache.hadoop.hbase.CellUtil$1.advance(CellUtil.java:210)
>        (and so on...)
> 
> Filtering the .out file for "Exception" shows that several handlers crashed
> like that:
> 
> Exception in thread "defaultRpcServer.handler=5,queue=2,port=60020"
> java.lang.StackOverflowError
> Exception in thread "defaultRpcServer.handler=18,queue=0,port=60020"
> java.lang.StackOverflowError
> Exception in thread "defaultRpcServer.handler=23,queue=2,port=60020"
> java.lang.StackOverflowError
> Exception in thread "defaultRpcServer.handler=24,queue=0,port=60020"
> java.lang.StackOverflowError
> Exception in thread "defaultRpcServer.handler=2,queue=2,port=60020"
> java.lang.StackOverflowError
> Exception in thread "defaultRpcServer.handler=11,queue=2,port=60020"
> java.lang.StackOverflowError
> Exception in thread "defaultRpcServer.handler=25,queue=1,port=60020"
> java.lang.StackOverflowError
> Exception in thread "defaultRpcServer.handler=20,queue=2,port=60020"
> java.lang.StackOverflowError
> Exception in thread "defaultRpcServer.handler=19,queue=1,port=60020"
> java.lang.StackOverflowError
> Exception in thread "defaultRpcServer.handler=15,queue=0,port=60020"
> java.lang.StackOverflowError
> Exception in thread "defaultRpcServer.handler=1,queue=1,port=60020"
> java.lang.StackOverflowError
> Exception in thread "defaultRpcServer.handler=7,queue=1,port=60020"
> java.lang.StackOverflowError
> Exception in thread "defaultRpcServer.handler=4,queue=1,port=60020"
> java.lang.StackOverflowError
> 
> Unfortunately, the exceptions are not timestamped so that I can not
> correlate their occurrence with the exact time when the RS starts filling
> up the queue.
> 
> 
> On Sat, Aug 23, 2014 at 8:28 PM, Stack <stack@duboce.net> wrote:
> 
>> Anything in your .out that could help explain our losing handlers if you
>> can't find anything in the logs?
>> 
>> You did the 'snipp' in the below, right Johannes?
>> 
>> RS Configuration:
>> ===========================================================
>> <?xml version="1.0" encoding="UTF-8" standalone="no"?><configuration>
>> [snipp] no fancy stuff, all default, except absolute necessary settings
>> [snipp]
>> </configuration>
>> 
>> As per Qian, if hbase.ipc.server.callqueue.handler.factor, that could help
>> explain why we have handlers but they are not 'taking' from the call queue,
>> they are stuck taking on those queues that do not have calls queued.
>> 
>> St.Ack
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> On Sat, Aug 23, 2014 at 2:56 AM, Qiang Tian <tianq01@gmail.com> wrote:
>> 
>>> Did you set hbase.ipc.server.callqueue.handler.factor?
>>> it looks there are 3 queues, handlers on queue 1 are all gone as Stack
>>> mentioned.  jstack and pastebin regions server log would help.
>>> 
>>> 
>>> 
>>> 
>>> 
>>> On Sat, Aug 23, 2014 at 7:02 AM, Stack <stack@duboce.net> wrote:
>>> 
>>>> On Fri, Aug 22, 2014 at 3:24 PM, Johannes Schaback <
>>>> johannes.schaback@visual-meta.com> wrote:
>>>> 
>>>>> ...
>>>>> I grep'ed "defaultRpcServer.handler=" on the log from that particular
>>> RS.
>>>>> The
>>>>> RS started at 15:35. After that, the handlers
>>>>> 
>>>>> 6, 24, 0, 15, 28, 26, 7, 19, 21, 3, 5 and 23
>>>>> 
>>>>> make an appearance in error messages of HDFS related exceptions:
>>>>> 
>>>>> 2014-08-22 13:54:57,470 WARN
>>>>> [defaultRpcServer.handler=6,queue=0,port=60020]
>>> hdfs.BlockReaderFactory:
>>>>> I/O error constructing remote block reader.
>>>>> 2014-08-22 13:54:57,472 WARN
>>>>> [defaultRpcServer.handler=6,queue=0,port=60020] hdfs.DFSClient:
>>>> Connection
>>>>> failure: Failed to connect to /192.168.3.233:50010 for file
>>>>> /hbase/data/default/image/dae1b8e3bfdb608571d09916bf0f
>>>>> a156/cf/866a773857654b0d83275dc4e4558be6 for block
>> BP-1157637685-192.168.3.192-1382642140917:blk_1294949058_255264636:java.io.IOException:
>>>>> Got error for OP_READ_BLOCK, self=/192.168.3.179:37835, remote=/
>>>>> 192.168.3.233:50      010, for file
>> /hbase/data/default/image/dae1b8e3bfdb608571d09916bf0fa156/cf/866a773857654b0d83275dc4e4558be6,
>>>>> for pool BP-1157637685-192.168.3.192-1382642140917 block
>>>>> 1294949058_255264636
>>>>> 2014-08-22 13:56:58,525 WARN
>>>>> [defaultRpcServer.handler=24,queue=0,port=60020]
>>>> hdfs.BlockReaderFactory:
>>>>> I/O error constructing remote block reader.
>>>>> 2014-08-22 13:56:58,799 WARN
>>>>> [defaultRpcServer.handler=24,queue=0,port=60020] hdfs.DFSClient:
>>>>> Connection failure: Failed to connect to /192.168.3.142:50010 for
>> file
>>>>> /hbase/data/default/image/cf493cbd9921ae6ca5e5281cc07
>>>>> 18ca2/cf/7f9618dcdeae40ddbff21165d08e0a83 for block
>> BP-1157637685-192.168.3.192-1382642140917:blk_1294072754_254292863:java.io.IOException:
>>>>> Got error for OP_READ_BLOCK, self=/192.168.3.179:59268, remote=/
>>>>> 192.168.3.142:5      0010, for file
>> /hbase/data/default/image/cf493cbd9921ae6ca5e5281cc0718ca2/cf/7f9618dcdeae40ddbff21165d08e0a83,
>>>>> for pool BP-1157637685-192.168.3.192-1382642140917 block
>>>>> 1294072754_254292863
>>>>> 2014-08-22 14:08:01,632 WARN
>>>>> [defaultRpcServer.handler=0,queue=0,port=60020]
>>> hdfs.BlockReaderFactory:
>>>>> I/O error constructing remote block reader.
>>>>> 2014-08-22 14:08:01,633 WARN
>>>>> [defaultRpcServer.handler=0,queue=0,port=60020] hdfs.DFSClient:
>>>> Connection
>>>>> failure: Failed to connect to /192.168.3.53:50010 for file
>>>>> /hbase/data/default/image/af02bc7fb404f4c054dcd64b44b0e
>>>>> 2a9/cf/a8881b6e2d5d41b3b56fd34fd4ca8ffd for block
>> BP-1157637685-192.168.3.192-1382642140917:blk_1294876123_255182439:java.io.IOException:
>>>>> Got error for OP_READ_BLOCK, self=/192.168.3.179:60737, remote=/
>>>>> 192.168.3.53:5001      0, for file
>> /hbase/data/default/image/af02bc7fb404f4c054dcd64b44b0e2a9/cf/a8881b6e2d5d41b3b56fd34fd4ca8ffd,
>>>>> for pool BP-1157637685-192.168.3.192-1382642140917 block
>>>>> 1294876123_255182439
>>>>> 2014-08-22 14:11:34,192 WARN
>>>>> [defaultRpcServer.handler=15,queue=0,port=60020] ipc.RpcServer:
>>>>> (responseTooSlow):
>> {"processingtimems":15432,"call":"Multi(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$MultiReque
>>>>>     st)","client":"192.168.3.54:52838
>> ","starttimems":1408709478713,"queuetimems":0,"class":"HRegionServer","responsesize":99638,"method":"Multi"}
>>>>> 2014-08-22 14:22:21,847 WARN
>>>>> [defaultRpcServer.handler=28,queue=1,port=60020]
>>>> hdfs.BlockReaderFactory:
>>>>> I/O error constructing remote block reader.
>>>>> 2014-08-22 14:22:21,848 WARN
>>>>> [defaultRpcServer.handler=28,queue=1,port=60020] hdfs.DFSClient:
>>>>> Connection failure: Failed to connect to /192.168.3.233:50010 for
>> file
>>>>> /hbase/data/default/image/1d33a251862055cd999078bbd10
>>>>> aa44c/cf/a7470351725f4cc192b6210bac9b7c44 for block
>> BP-1157637685-192.168.3.192-1382642140917:blk_1294949316_255264902:java.io.IOException:
>>>>> Got error for OP_READ_BLOCK, self=/192.168.3.179:44787, remote=/
>>>>> 192.168.3.233:5      0010, for file
>> /hbase/data/default/image/1d33a251862055cd999078bbd10aa44c/cf/a7470351725f4cc192b6210bac9b7c44,
>>>>> for pool BP-1157637685-192.168.3.192-1382642140917 block
>>>>> 1294949316_255264902
>>>>> 2014-08-22 14:23:22,628 WARN
>>>>> [defaultRpcServer.handler=26,queue=2,port=60020]
>>>> hdfs.BlockReaderFactory:
>>>>> I/O error constructing remote block reader.
>>>>> 2014-08-22 14:23:22,628 WARN
>>>>> [defaultRpcServer.handler=26,queue=2,port=60020] hdfs.DFSClient:
>>>>> Connection failure: Failed to connect to /192.168.3.142:50010 for
>> file
>>>>> /hbase/data/default/image/4a9c830a4fe006f0a6af7418164
>>>>> dd86d/cf/47d6f2c3f7054d40aed66cae9787c464 for block
>> BP-1157637685-192.168.3.192-1382642140917:blk_1293459594_253612452:java.io.IOException:
>>>>> Got error for OP_READ_BLOCK, self=/192.168.3.179:37660, remote=/
>>>>> 192.168.3.142:5      0010, for file
>> /hbase/data/default/image/4a9c830a4fe006f0a6af7418164dd86d/cf/47d6f2c3f7054d40aed66cae9787c464,
>>>>> for pool BP-1157637685-192.168.3.192-1382642140917 block
>>>>> 1293459594_253612452
>>>>> 2014-08-22 14:25:35,003 WARN
>>>>> [defaultRpcServer.handler=7,queue=1,port=60020]
>>> hdfs.BlockReaderFactory:
>>>>> I/O error constructing remote block reader.
>>>>> 2014-08-22 14:25:35,004 WARN
>>>>> [defaultRpcServer.handler=7,queue=1,port=60020] hdfs.DFSClient:
>>>> Connection
>>>>> failure: Failed to connect to /192.168.3.51:50010 for file
>>>>> /hbase/data/default/run_automaton_cache/da03f8123004be3
>>>>> 2659e1c8a51afbbf8/cf/1daf234e740f4b00889bb60e574dc79b for block
>> BP-1157637685-192.168.3.192-1382642140917:blk_1294614349_254896345:java.io.IOException:
>>>>> Got error for OP_READ_BLOCK, self=/192.168.3.179:53627, remote=/192
>>>>> .168.3.51:50010, for file
>> /hbase/data/default/run_automaton_cache/da03f8123004be32659e1c8a51afbbf8/cf/1daf234e740f4b00889bb60e574dc79b,
>>>>> for pool BP-1157637685-192.168.3.192-1382642140917 block
>>>>> 1294614349_254896345
>>>>> 2014-08-22 14:25:46,831 WARN
>>>>> [defaultRpcServer.handler=19,queue=1,port=60020]
>>>> hdfs.BlockReaderFactory:
>>>>> I/O error constructing remote block reader.
>>>>> 2014-08-22 14:25:46,832 WARN
>>>>> [defaultRpcServer.handler=19,queue=1,port=60020] hdfs.DFSClient:
>>>>> Connection failure: Failed to connect to /192.168.3.30:50010 for
>> file
>>>>> /hbase/data/default/image/1c3bab43e260ddb46a06cf04e293
>>>>> e386/cf/740f8240ac9a4abc9e5fcf6ec7df18bc for block
>> BP-1157637685-192.168.3.192-1382642140917:blk_1294151564_254380545:java.io.IOException:
>>>>> Got error for OP_READ_BLOCK, self=/192.168.3.179:55660, remote=/
>>>>> 192.168.3.30:500      10, for file
>> /hbase/data/default/image/1c3bab43e260ddb46a06cf04e293e386/cf/740f8240ac9a4abc9e5fcf6ec7df18bc,
>>>>> for pool BP-1157637685-192.168.3.192-1382642140917 block
>>>>> 1294151564_254380545
>>>>> 2014-08-22 14:28:22,395 WARN
>>>>> [defaultRpcServer.handler=26,queue=2,port=60020]
>>>> hdfs.BlockReaderFactory:
>>>>> I/O error constructing remote block reader.
>>>>> 2014-08-22 14:28:22,397 WARN
>>>>> [defaultRpcServer.handler=26,queue=2,port=60020] hdfs.DFSClient:
>>>>> Connection failure: Failed to connect to /192.168.3.143:50010 for
>> file
>>>>> /hbase/data/default/image/b4b48e20a606c431e393b674f92
>>>>> 79daf/cf/76ff3628306a48f187a285b2a21d9ac9 for block
>> BP-1157637685-192.168.3.192-1382642140917:blk_1294308080_254554506:java.io.IOException:
>>>>> Got error for OP_READ_BLOCK, self=/192.168.3.179:51759, remote=/
>>>>> 192.168.3.143:5      0010, for file
>> /hbase/data/default/image/b4b48e20a606c431e393b674f9279daf/cf/76ff3628306a48f187a285b2a21d9ac9,
>>>>> for pool BP-1157637685-192.168.3.192-1382642140917 block
>>>>> 1294308080_254554506
>>>>> 2014-08-22 14:30:29,395 WARN
>>>>> [defaultRpcServer.handler=21,queue=0,port=60020]
>>>> hdfs.BlockReaderFactory:
>>>>> I/O error constructing remote block reader.
>>>>> 2014-08-22 14:30:29,396 WARN
>>>>> [defaultRpcServer.handler=21,queue=0,port=60020] hdfs.DFSClient:
>>>>> Connection failure: Failed to connect to /192.168.3.30:50010 for
>> file
>>>>> /hbase/data/default/image/77ceb80d73c065f1a4ba2ad3b7cf
>>>>> 04a2/cf/151c244177ac42b5996af0c9052660cc for block
>> BP-1157637685-192.168.3.192-1382642140917:blk_1294365296_254618150:java.io.IOException:
>>>>> Got error for OP_READ_BLOCK, self=/192.168.3.179:57238, remote=/
>>>>> 192.168.3.30:500      10, for file
>> /hbase/data/default/image/77ceb80d73c065f1a4ba2ad3b7cf04a2/cf/151c244177ac42b5996af0c9052660cc,
>>>>> for pool BP-1157637685-192.168.3.192-1382642140917 block
>>>>> 1294365296_254618150
>>>>> 2014-08-22 14:31:34,016 WARN
>>>>> [defaultRpcServer.handler=6,queue=0,port=60020]
>>> hdfs.BlockReaderFactory:
>>>>> I/O error constructing remote block reader.
>>>>> 2014-08-22 14:31:34,018 WARN
>>>>> [defaultRpcServer.handler=6,queue=0,port=60020] hdfs.DFSClient:
>>>> Connection
>>>>> failure: Failed to connect to /192.168.3.212:50010 for file
>>>>> /hbase/data/default/image/45578853a5c807919578043c7715
>>>>> 1efa/cf/40c00d49a7494929a319467d82b383da for block
>> BP-1157637685-192.168.3.192-1382642140917:blk_1294930505_255244337:java.io.IOException:
>>>>> Got error for OP_READ_BLOCK, self=/192.168.3.179:33973, remote=/
>>>>> 192.168.3.212:50      010, for file
>> /hbase/data/default/image/45578853a5c807919578043c77151efa/cf/40c00d49a7494929a319467d82b383da,
>>>>> for pool BP-1157637685-192.168.3.192-1382642140917 block
>>>>> 1294930505_255244337
>>>>> 2014-08-22 14:33:55,994 WARN
>>>>> [defaultRpcServer.handler=3,queue=0,port=60020]
>>> hdfs.BlockReaderFactory:
>>>>> I/O error constructing remote block reader.
>>>>> 2014-08-22 14:33:55,995 WARN
>>>>> [defaultRpcServer.handler=3,queue=0,port=60020] hdfs.DFSClient:
>>>> Connection
>>>>> failure: Failed to connect to /192.168.3.52:50010 for file
>>>>> /hbase/data/default/image/1f07e0160fbba29a2c26104dd5966
>>>>> 39f/cf/7e36136497fc4a5a9889797a1dfb5d3b for block
>> BP-1157637685-192.168.3.192-1382642140917:blk_1294915607_255226575:java.io.IOException:
>>>>> Got error for OP_READ_BLOCK, self=/192.168.3.179:41017, remote=/
>>>>> 192.168.3.52:5001      0, for file
>> /hbase/data/default/image/1f07e0160fbba29a2c26104dd596639f/cf/7e36136497fc4a5a9889797a1dfb5d3b,
>>>>> for pool BP-1157637685-192.168.3.192-1382642140917 block
>>>>> 1294915607_255226575
>>>>> 2014-08-22 14:44:14,301 WARN
>>>>> [defaultRpcServer.handler=3,queue=0,port=60020]
>>> hdfs.BlockReaderFactory:
>>>>> I/O error constructing remote block reader.
>>>>> 2014-08-22 14:44:14,302 WARN
>>>>> [defaultRpcServer.handler=3,queue=0,port=60020] hdfs.DFSClient:
>>>> Connection
>>>>> failure: Failed to connec

Mime
View raw message