hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Johannes Schaback <johannes.schab...@visual-meta.com>
Subject Re: single RegionServer stuck, causing cluster to hang
Date Fri, 22 Aug 2014 22:24:53 GMT
I havent managed to pull a jstack of a stuck node yet (I will do that first
thing in the morning). But...

I just killed and restarted a RS and called /dump right away to see whether
the defaultRpcServer.handler instances are present. And yes, they are. From
0 to 29, even in consecutive order. I killed and restarted that RS again to
be sure. Also, I checked several healthy RS and they got the complete set
of handler threads as well shown with /dump. So I believe it is reasonable
to assume that /dump does not miss threads.

Coming back to the particular handlers of the broken RS in question; with
the surviving handlers 27, 24, 21, 20, 18, 12, 11 and 6 after the freeze,
this set of handlers

0, 1, 2, 3, 4, 5, 7, 8, 9, 10, 13, 14, 15, 16, 17, 19, 22, 23, 25, 26, 28
and 29

did apparently get lost somehow and therefore did not appear in /dump after
the freeeze.

I grep'ed "defaultRpcServer.handler=" on the log from that particular RS. The
RS started at 15:35. After that, the handlers

6, 24, 0, 15, 28, 26, 7, 19, 21, 3, 5 and 23

make an appearance in error messages of HDFS related exceptions:

2014-08-22 13:54:57,470 WARN
 [defaultRpcServer.handler=6,queue=0,port=60020] hdfs.BlockReaderFactory:
I/O error constructing remote block reader.
2014-08-22 13:54:57,472 WARN
 [defaultRpcServer.handler=6,queue=0,port=60020] hdfs.DFSClient: Connection
failure: Failed to connect to /192.168.3.233:50010 for file
/hbase/data/default/image/dae1b8e3bfdb608571d09916bf0f
 a156/cf/866a773857654b0d83275dc4e4558be6 for block
BP-1157637685-192.168.3.192-1382642140917:blk_1294949058_255264636:java.io.IOException:
Got error for OP_READ_BLOCK, self=/192.168.3.179:37835, remote=/
192.168.3.233:50      010, for file
/hbase/data/default/image/dae1b8e3bfdb608571d09916bf0fa156/cf/866a773857654b0d83275dc4e4558be6,
for pool BP-1157637685-192.168.3.192-1382642140917 block
1294949058_255264636
2014-08-22 13:56:58,525 WARN
 [defaultRpcServer.handler=24,queue=0,port=60020] hdfs.BlockReaderFactory:
I/O error constructing remote block reader.
2014-08-22 13:56:58,799 WARN
 [defaultRpcServer.handler=24,queue=0,port=60020] hdfs.DFSClient:
Connection failure: Failed to connect to /192.168.3.142:50010 for file
/hbase/data/default/image/cf493cbd9921ae6ca5e5281cc07
 18ca2/cf/7f9618dcdeae40ddbff21165d08e0a83 for block
BP-1157637685-192.168.3.192-1382642140917:blk_1294072754_254292863:java.io.IOException:
Got error for OP_READ_BLOCK, self=/192.168.3.179:59268, remote=/
192.168.3.142:5      0010, for file
/hbase/data/default/image/cf493cbd9921ae6ca5e5281cc0718ca2/cf/7f9618dcdeae40ddbff21165d08e0a83,
for pool BP-1157637685-192.168.3.192-1382642140917 block
1294072754_254292863
2014-08-22 14:08:01,632 WARN
 [defaultRpcServer.handler=0,queue=0,port=60020] hdfs.BlockReaderFactory:
I/O error constructing remote block reader.
2014-08-22 14:08:01,633 WARN
 [defaultRpcServer.handler=0,queue=0,port=60020] hdfs.DFSClient: Connection
failure: Failed to connect to /192.168.3.53:50010 for file
/hbase/data/default/image/af02bc7fb404f4c054dcd64b44b0e
 2a9/cf/a8881b6e2d5d41b3b56fd34fd4ca8ffd for block
BP-1157637685-192.168.3.192-1382642140917:blk_1294876123_255182439:java.io.IOException:
Got error for OP_READ_BLOCK, self=/192.168.3.179:60737, remote=/
192.168.3.53:5001      0, for file
/hbase/data/default/image/af02bc7fb404f4c054dcd64b44b0e2a9/cf/a8881b6e2d5d41b3b56fd34fd4ca8ffd,
for pool BP-1157637685-192.168.3.192-1382642140917 block
1294876123_255182439
2014-08-22 14:11:34,192 WARN
 [defaultRpcServer.handler=15,queue=0,port=60020] ipc.RpcServer:
(responseTooSlow):
{"processingtimems":15432,"call":"Multi(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$MultiReque
     st)","client":"192.168.3.54:52838
","starttimems":1408709478713,"queuetimems":0,"class":"HRegionServer","responsesize":99638,"method":"Multi"}
2014-08-22 14:22:21,847 WARN
 [defaultRpcServer.handler=28,queue=1,port=60020] hdfs.BlockReaderFactory:
I/O error constructing remote block reader.
2014-08-22 14:22:21,848 WARN
 [defaultRpcServer.handler=28,queue=1,port=60020] hdfs.DFSClient:
Connection failure: Failed to connect to /192.168.3.233:50010 for file
/hbase/data/default/image/1d33a251862055cd999078bbd10
 aa44c/cf/a7470351725f4cc192b6210bac9b7c44 for block
BP-1157637685-192.168.3.192-1382642140917:blk_1294949316_255264902:java.io.IOException:
Got error for OP_READ_BLOCK, self=/192.168.3.179:44787, remote=/
192.168.3.233:5      0010, for file
/hbase/data/default/image/1d33a251862055cd999078bbd10aa44c/cf/a7470351725f4cc192b6210bac9b7c44,
for pool BP-1157637685-192.168.3.192-1382642140917 block
1294949316_255264902
2014-08-22 14:23:22,628 WARN
 [defaultRpcServer.handler=26,queue=2,port=60020] hdfs.BlockReaderFactory:
I/O error constructing remote block reader.
2014-08-22 14:23:22,628 WARN
 [defaultRpcServer.handler=26,queue=2,port=60020] hdfs.DFSClient:
Connection failure: Failed to connect to /192.168.3.142:50010 for file
/hbase/data/default/image/4a9c830a4fe006f0a6af7418164
 dd86d/cf/47d6f2c3f7054d40aed66cae9787c464 for block
BP-1157637685-192.168.3.192-1382642140917:blk_1293459594_253612452:java.io.IOException:
Got error for OP_READ_BLOCK, self=/192.168.3.179:37660, remote=/
192.168.3.142:5      0010, for file
/hbase/data/default/image/4a9c830a4fe006f0a6af7418164dd86d/cf/47d6f2c3f7054d40aed66cae9787c464,
for pool BP-1157637685-192.168.3.192-1382642140917 block
1293459594_253612452
2014-08-22 14:25:35,003 WARN
 [defaultRpcServer.handler=7,queue=1,port=60020] hdfs.BlockReaderFactory:
I/O error constructing remote block reader.
2014-08-22 14:25:35,004 WARN
 [defaultRpcServer.handler=7,queue=1,port=60020] hdfs.DFSClient: Connection
failure: Failed to connect to /192.168.3.51:50010 for file
/hbase/data/default/run_automaton_cache/da03f8123004be3
 2659e1c8a51afbbf8/cf/1daf234e740f4b00889bb60e574dc79b for block
BP-1157637685-192.168.3.192-1382642140917:blk_1294614349_254896345:java.io.IOException:
Got error for OP_READ_BLOCK, self=/192.168.3.179:53627, remote=/192
 .168.3.51:50010, for file
/hbase/data/default/run_automaton_cache/da03f8123004be32659e1c8a51afbbf8/cf/1daf234e740f4b00889bb60e574dc79b,
for pool BP-1157637685-192.168.3.192-1382642140917 block
1294614349_254896345
2014-08-22 14:25:46,831 WARN
 [defaultRpcServer.handler=19,queue=1,port=60020] hdfs.BlockReaderFactory:
I/O error constructing remote block reader.
2014-08-22 14:25:46,832 WARN
 [defaultRpcServer.handler=19,queue=1,port=60020] hdfs.DFSClient:
Connection failure: Failed to connect to /192.168.3.30:50010 for file
/hbase/data/default/image/1c3bab43e260ddb46a06cf04e293
 e386/cf/740f8240ac9a4abc9e5fcf6ec7df18bc for block
BP-1157637685-192.168.3.192-1382642140917:blk_1294151564_254380545:java.io.IOException:
Got error for OP_READ_BLOCK, self=/192.168.3.179:55660, remote=/
192.168.3.30:500      10, for file
/hbase/data/default/image/1c3bab43e260ddb46a06cf04e293e386/cf/740f8240ac9a4abc9e5fcf6ec7df18bc,
for pool BP-1157637685-192.168.3.192-1382642140917 block
1294151564_254380545
2014-08-22 14:28:22,395 WARN
 [defaultRpcServer.handler=26,queue=2,port=60020] hdfs.BlockReaderFactory:
I/O error constructing remote block reader.
2014-08-22 14:28:22,397 WARN
 [defaultRpcServer.handler=26,queue=2,port=60020] hdfs.DFSClient:
Connection failure: Failed to connect to /192.168.3.143:50010 for file
/hbase/data/default/image/b4b48e20a606c431e393b674f92
 79daf/cf/76ff3628306a48f187a285b2a21d9ac9 for block
BP-1157637685-192.168.3.192-1382642140917:blk_1294308080_254554506:java.io.IOException:
Got error for OP_READ_BLOCK, self=/192.168.3.179:51759, remote=/
192.168.3.143:5      0010, for file
/hbase/data/default/image/b4b48e20a606c431e393b674f9279daf/cf/76ff3628306a48f187a285b2a21d9ac9,
for pool BP-1157637685-192.168.3.192-1382642140917 block
1294308080_254554506
2014-08-22 14:30:29,395 WARN
 [defaultRpcServer.handler=21,queue=0,port=60020] hdfs.BlockReaderFactory:
I/O error constructing remote block reader.
2014-08-22 14:30:29,396 WARN
 [defaultRpcServer.handler=21,queue=0,port=60020] hdfs.DFSClient:
Connection failure: Failed to connect to /192.168.3.30:50010 for file
/hbase/data/default/image/77ceb80d73c065f1a4ba2ad3b7cf
 04a2/cf/151c244177ac42b5996af0c9052660cc for block
BP-1157637685-192.168.3.192-1382642140917:blk_1294365296_254618150:java.io.IOException:
Got error for OP_READ_BLOCK, self=/192.168.3.179:57238, remote=/
192.168.3.30:500      10, for file
/hbase/data/default/image/77ceb80d73c065f1a4ba2ad3b7cf04a2/cf/151c244177ac42b5996af0c9052660cc,
for pool BP-1157637685-192.168.3.192-1382642140917 block
1294365296_254618150
2014-08-22 14:31:34,016 WARN
 [defaultRpcServer.handler=6,queue=0,port=60020] hdfs.BlockReaderFactory:
I/O error constructing remote block reader.
2014-08-22 14:31:34,018 WARN
 [defaultRpcServer.handler=6,queue=0,port=60020] hdfs.DFSClient: Connection
failure: Failed to connect to /192.168.3.212:50010 for file
/hbase/data/default/image/45578853a5c807919578043c7715
 1efa/cf/40c00d49a7494929a319467d82b383da for block
BP-1157637685-192.168.3.192-1382642140917:blk_1294930505_255244337:java.io.IOException:
Got error for OP_READ_BLOCK, self=/192.168.3.179:33973, remote=/
192.168.3.212:50      010, for file
/hbase/data/default/image/45578853a5c807919578043c77151efa/cf/40c00d49a7494929a319467d82b383da,
for pool BP-1157637685-192.168.3.192-1382642140917 block
1294930505_255244337
2014-08-22 14:33:55,994 WARN
 [defaultRpcServer.handler=3,queue=0,port=60020] hdfs.BlockReaderFactory:
I/O error constructing remote block reader.
2014-08-22 14:33:55,995 WARN
 [defaultRpcServer.handler=3,queue=0,port=60020] hdfs.DFSClient: Connection
failure: Failed to connect to /192.168.3.52:50010 for file
/hbase/data/default/image/1f07e0160fbba29a2c26104dd5966
 39f/cf/7e36136497fc4a5a9889797a1dfb5d3b for block
BP-1157637685-192.168.3.192-1382642140917:blk_1294915607_255226575:java.io.IOException:
Got error for OP_READ_BLOCK, self=/192.168.3.179:41017, remote=/
192.168.3.52:5001      0, for file
/hbase/data/default/image/1f07e0160fbba29a2c26104dd596639f/cf/7e36136497fc4a5a9889797a1dfb5d3b,
for pool BP-1157637685-192.168.3.192-1382642140917 block
1294915607_255226575
2014-08-22 14:44:14,301 WARN
 [defaultRpcServer.handler=3,queue=0,port=60020] hdfs.BlockReaderFactory:
I/O error constructing remote block reader.
2014-08-22 14:44:14,302 WARN
 [defaultRpcServer.handler=3,queue=0,port=60020] hdfs.DFSClient: Connection
failure: Failed to connect to /192.168.3.154:50010 for file
/hbase/data/default/image/bb082a5f098d1d2a95365545349c
 77a4/cf/1af06c6fd36045bab888b4ec18e45f0f for block
BP-1157637685-192.168.3.192-1382642140917:blk_1294911152_255221397:java.io.IOException:
Got error for OP_READ_BLOCK, self=/192.168.3.179:44711, remote=/
192.168.3.154:50      010, for file
/hbase/data/default/image/bb082a5f098d1d2a95365545349c77a4/cf/1af06c6fd36045bab888b4ec18e45f0f,
for pool BP-1157637685-192.168.3.192-1382642140917 block
1294911152_255221397
77a4/cf/1af06c6fd36045bab888b4ec18e45f0f for block
BP-1157637685-192.168.3.192-1382642140917:blk_1294911152_255221397:java.io.IOException:
Got error for OP_READ_BLOCK, self=/192.168.3.179:44711, remote=/
192.168.3.154:50      010, for file
/hbase/data/default/image/bb082a5f098d1d2a95365545349c77a4/cf/1af06c6fd36045bab888b4ec18e45f0f,
for pool BP-1157637685-192.168.3.192-1382642140917 block
1294911152_255221397
2014-08-22 14:49:14,780 WARN
 [defaultRpcServer.handler=5,queue=2,port=60020] ipc.RpcServer:
(responseTooSlow):
{"processingtimems":15494,"call":"Multi(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$MultiReques
     t)","client":"192.168.3.11:53884
","starttimems":1408711739283,"queuetimems":0,"class":"HRegionServer","responsesize":483492,"method":"Multi"}
2014-08-22 14:50:37,900 WARN
 [defaultRpcServer.handler=7,queue=1,port=60020] hdfs.BlockReaderFactory:
I/O error constructing remote block reader.
2014-08-22 14:50:37,901 WARN
 [defaultRpcServer.handler=7,queue=1,port=60020] hdfs.DFSClient: Connection
failure: Failed to connect to /192.168.3.64:50010 for file
/hbase/data/default/image/a50d67b81c44c864265f5030c7c39
 959/cf/840ff51e591946c487413273d5341a24 for block
BP-1157637685-192.168.3.192-1382642140917:blk_1294309355_254555905:java.io.IOException:
Got error for OP_READ_BLOCK, self=/192.168.3.179:45043, remote=/
192.168.3.64:5001      0, for file
/hbase/data/default/image/a50d67b81c44c864265f5030c7c39959/cf/840ff51e591946c487413273d5341a24,
for pool BP-1157637685-192.168.3.192-1382642140917 block
1294309355_254555905
2014-08-22 15:09:35,289 WARN
 [defaultRpcServer.handler=23,queue=2,port=60020] hdfs.BlockReaderFactory:
I/O error constructing remote block reader.
2014-08-22 15:09:35,289 WARN
 [defaultRpcServer.handler=23,queue=2,port=60020] hdfs.DFSClient:
Connection failure: Failed to connect to /192.168.3.63:50010 for file
/hbase/data/default/image/1db3051c8d41943892c0230cb75b
 c1f2/cf/75ae96ea8ca2477e8f83291d7e1fe7cb for block
BP-1157637685-192.168.3.192-1382642140917:blk_1293205473_253317737:java.io.IOException:
Got error for OP_READ_BLOCK, self=/192.168.3.179:56326, remote=/
192.168.3.63:500      10, for file
/hbase/data/default/image/1db3051c8d41943892c0230cb75bc1f2/cf/75ae96ea8ca2477e8f83291d7e1fe7cb,
for pool BP-1157637685-192.168.3.192-1382642140917 block
1293205473_253317737
2014-08-22 15:18:38,891 WARN
 [defaultRpcServer.handler=5,queue=2,port=60020] ipc.RpcServer:
(responseTooSlow):
{"processingtimems":26092,"call":"Multi(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$MultiReques
     t)","client":"192.168.3.91:40189
","starttimems":1408713492722,"queuetimems":0,"class":"HRegionServer","responsesize":483492,"method":"Multi"}

What's interesting is that not all crashed handlers are later missing after
the freeze.

Does that substantiate your conjecture?


BR, Johannes


On Fri, Aug 22, 2014 at 10:42 PM, Stack <stack@duboce.net> wrote:

> Are we losing handler threads, the workers that take from the pool we are
> blocked on?
>
> The attached thread dump has ten with non-sequential numbers:
>
> Thread 97 (defaultRpcServer.handler=27,queue=0,port=60020):
> Thread 94 (defaultRpcServer.handler=24,queue=0,port=60020):
> Thread 91 (defaultRpcServer.handler=21,queue=0,port=60020):
> Thread 90 (defaultRpcServer.handler=20,queue=2,port=60020):
> Thread 88 (defaultRpcServer.handler=18,queue=0,port=60020):
> Thread 82 (defaultRpcServer.handler=12,queue=0,port=60020):
> Thread 81 (defaultRpcServer.handler=11,queue=2,port=60020):
> Thread 76 (defaultRpcServer.handler=6,queue=0,port=60020):
>
> Perhaps this is an artifact of how the thread dump is being taken via the
> UI servlet.
>
> If you jstack, do you see hbase.regionserver.handler.count instances of
> defaultRpcServer going from 0 up to hbase.regionserver.handler.count
>
> If handlers are not taking from the call queue, yeah, it will fill.
>
> St.Ack
>
>
>
>
> On Fri, Aug 22, 2014 at 12:54 PM, Stack <stack@duboce.net> wrote:
>
> > nvm. misread.  Trying to figure why the scheduling queue is filled to the
> > brim such that no more calls can be added/dispatched...
> > St.Ack
> >
> >
> > On Fri, Aug 22, 2014 at 12:45 PM, Stack <stack@duboce.net> wrote:
> >
> >> Are you replicating?
> >> St.Ack
> >>
> >>
> >> On Fri, Aug 22, 2014 at 10:28 AM, Johannes Schaback <
> >> johannes.schaback@visual-meta.com> wrote:
> >>
> >>> Dear HBase-Pros,
> >>>
> >>> we face a serious issue with our HBase production cluster for two days
> >>> now.
> >>> Every couple minutes, a random RegionServer gets stuck and does not
> >>> process
> >>> any requests. In addition this causes the other RegionServers to
> >>> freeze within a minute which brings down the entire cluster. Stopping
> the
> >>> affected RegionServer unblocks the cluster and everything comes back to
> >>> normal.
> >>>
> >>> We run 27 RegionServers, each having 31 GB JVM memory. The HBase
> Version
> >>> is
> >>> 0.98.5 on Hadoop 2.4.1. We basically have two tables, the first having
> >>> about 4,500 Regions and holding 8 TB with 1000 requests per second, the
> >>> second table is around 200 Regions with about 50,000 to 120,000
> requests
> >>> per sec over all Regions, 800 GB worth of data and with IN_MEMORY
> >>> enabled.
> >>>
> >>> While investigating the problem, I found out, that every healthy
> >>> RegionServer has the following thread:
> >>>
> >>> Thread 12 (RpcServer.listener,port=60020):
> >>>   State: RUNNABLE
> >>>   Blocked count: 35
> >>>   Waited count: 0
> >>>   Stack:
> >>>     sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
> >>>     sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
> >>>     sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:79)
> >>>     sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:87)
> >>>     sun.nio.ch.SelectorImpl.select(SelectorImpl.java:98)
> >>>     sun.nio.ch.SelectorImpl.select(SelectorImpl.java:102)
> >>>
> >>> org.apache.hadoop.hbase.ipc.RpcServer$Listener.run(RpcServer.java:684)
> >>>
> >>>
> >>> When suddenly becoming a blocked RegionServer, this particular thread
> >>> then
> >>> looks like
> >>>
> >>> Thread 12 (RpcServer.listener,port=60020):
> >>>   State: BLOCKED
> >>>   Blocked count: 2889
> >>>   Waited count: 0
> >>>   Blocked on
> >>> org.apache.hadoop.hbase.ipc.RpcServer$Listener$Reader@38cba1a1
> >>>   Blocked by 14 (RpcServer.reader=1,port=60020)
> >>>   Stack:
> >>>
> >>>
> >>>
> org.apache.hadoop.hbase.ipc.RpcServer$Listener$Reader.registerChannel(RpcServer.java:619)
> >>>
> >>>
> >>>
> org.apache.hadoop.hbase.ipc.RpcServer$Listener.doAccept(RpcServer.java:774)
> >>>
> >>> org.apache.hadoop.hbase.ipc.RpcServer$Listener.run(RpcServer.java:692)
> >>>
> >>>
> >>> Also, JMX shows for an unhealthy RegionServer that
> >>>
> >>>    - "queueSize" grows quickly and constantly to values greater than
> 60k,
> >>>    and
> >>>    - "numCallsInGeneralQueue" quickly reaches 300
> >>>
> >>> Both values are usually very small or 0 under normal circumstances, but
> >>> in
> >>> case of a RS "getting stuck" they explode, which leads me to believe
> that
> >>> the IPC-queue does not get processed properly causing the RegionServer
> to
> >>> become "deaf".
> >>>
> >>> These two symptoms appear to bring down the entire cluster. When
> killign
> >>> that RS, everyhing goes back to normal.
> >>>
> >>> I could not find any correlation between this phenomenon and
> compactions,
> >>> load or other factors. hbck says it is all fine as well.
> >>>
> >>> The servers are all 3.2.0-4-amd64 Debian, 12 cores, 96 GB RAM. Besides
> >>> the
> >>> RS and a DataNode, there isn't too much running on the boxes so the
> load
> >>> (top) is usually around 5 to 10 and bandwidth does not exceed 10 MB on
> >>> average.
> >>>
> >>> We currently survive by polling /jmx of all RegionServers constantly
> and
> >>> restarting those off that show the symptioms :(
> >>>
> >>> Do you have any idea what could be causing this?
> >>>
> >>> Thank you very much in advance!
> >>>
> >>> Johannes
> >>>
> >>
> >>
> >
>



-- 
LadenZeile.de <http://www.ladenzeile.de/>
powered by Visual Meta GmbH - www.visual-meta.com

Tel.: +49 30 / 609 84 88 20
Fax: +49 30 / 609 84 88 21
E-Mail: johannes.schaback@visual-meta.com

Visual Meta GmbH, Schützenstraße 25, 10117 Berlin
Geschäftsführer: Robert M. Maier, Johannes Schaback
Handelsregister HRB 115795 B, Amtsgericht Charlottenburg
USt-IdNr.: DE263760203

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message