hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alex Baranau <alex.barano...@gmail.com>
Subject RegionServer crashed with ZK session expired
Date Thu, 09 Dec 2010 17:19:47 GMT
Hello guys,

I got regionserver crash and trying to find out why. I found
* in regionserver log [1] : ZK session expired, and before that slow hlog
edits;
* nothing in DataNode log [2] and HMaster[4] log;
* some warns in ZK log [3] with EndOfStreamException.

I wonder can it be because of long GC and thus "stop the world" pause for
about a minute (seen in logs)? Unfortunately GC wasn't logged (this is an
old test box). If it can be caused by GC I'd restart the server with GC
logging, but if not, I'd like to figure out the cause...

HBase version is 0.89.20100924.

Thank you,

Alex Baranau
----
Sematext :: http://sematext.com/

[1] Regionserver log: ZK session expired, before that slow hlog edits

2010-12-08 06:43:15,403 INFO org.apache.hadoop.hbase.regionserver.HRegion:
Starting compaction on region agg2,system|tq|s\x00|in licensed License
project\x00|,1291790014280.75a3871bc968e0438e763cde7e313c2c.
2010-12-08 06:43:15,404 INFO org.apache.hadoop.hbase.regionserver.Store:
Started compaction of 4 file(s) in a of agg2,system|tq|s\x00|in licensed
License project\x00|,1291790014280.75a3871bc968e0438e763cde7e313c2c.,
hasReferences=true, into
hdfs://localhost/hbase/agg2/75a3871bc968e0438e763cde7e313c2c/.tmp,
sequenceid=1153119174
2010-12-08 06:43:24,825 DEBUG
org.apache.hadoop.hbase.io.hfile.LruBlockCache: Block cache LRU eviction
started; Attempting to free 39.93 MB of total=339.31 MB
2010-12-08 06:43:24,829 DEBUG
org.apache.hadoop.hbase.io.hfile.LruBlockCache: Block cache LRU eviction
completed; freed=39.95 MB, total=299.36 MB, single=147.93 MB, multi=154.82
MB, memory=33.28 MB
2010-12-08 06:43:50,871 DEBUG
org.apache.hadoop.hbase.io.hfile.LruBlockCache: Block cache LRU eviction
started; Attempting to free 40 MB of total=339.39 MB
2010-12-08 06:43:50,875 DEBUG
org.apache.hadoop.hbase.io.hfile.LruBlockCache: Block cache LRU eviction
completed; freed=40.01 MB, total=299.37 MB, single=147.86 MB, multi=154.96
MB, memory=33.28 MB
2010-12-08 06:43:57,095 WARN org.apache.hadoop.hbase.regionserver.wal.HLog:
IPC Server handler 4 on 60020 took 5338ms appending an edit to hlog;
editcount=8882
2010-12-08 06:44:25,086 WARN org.apache.hadoop.hbase.regionserver.wal.HLog:
IPC Server handler 24 on 60020 took 5658ms appending an edit to hlog;
editcount=13054
2010-12-08 06:44:55,815 WARN org.apache.hadoop.hbase.regionserver.wal.HLog:
IPC Server handler 5 on 60020 took 5884ms appending an edit to hlog;
editcount=13472
2010-12-08 06:45:33,003 DEBUG
org.apache.hadoop.hbase.io.hfile.LruBlockCache: LRU Stats: total=337.5 MB,
free=61.68 MB, max=399.18 MB, blocks=9015, accesses=553728227,
hits=40478965, hitRatio=7.31%%, evictions=135536, evicted=86306453,
evictedPerRun=636.77880859375
2010-12-08 06:45:45,333 INFO
org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Forced flushing of
raw3,,1291780273350.73d076a68191712c68d16c4805973561. because global
memstore limit of 798.4m exceeded; currently 812.8m and flushing till 698.6m
2010-12-08 06:45:45,363 DEBUG org.apache.hadoop.hbase.regionserver.HRegion:
Started memstore flush for region
raw3,,1291780273350.73d076a68191712c68d16c4805973561.. Current region
memstore size 45.0m
2010-12-08 06:45:45,364 DEBUG org.apache.hadoop.hbase.regionserver.HRegion:
Finished snapshotting, commencing flushing stores
2010-12-08 06:45:58,107 INFO org.apache.hadoop.hbase.regionserver.Store:
Renaming flushed file at
hdfs://localhost/hbase/raw3/73d076a68191712c68d16c4805973561/.tmp/6597521672231455831
to
hdfs://localhost/hbase/raw3/73d076a68191712c68d16c4805973561/c/3157254097847060580
2010-12-08 06:45:58,164 INFO org.apache.hadoop.hbase.regionserver.Store:
Added
hdfs://localhost/hbase/raw3/73d076a68191712c68d16c4805973561/c/3157254097847060580,
entries=8356, sequenceid=1153215388, memsize=1.7m, filesize=73.2k to
raw3,,1291780273350.73d076a68191712c68d16c4805973561.
2010-12-08 06:46:44,092 INFO org.apache.zookeeper.ClientCnxn: Client session
timed out, have not heard from server in 52481ms for sessionid
0x12c34f1783c0002, closing socket connection and attempting reconnect
2010-12-08 06:46:44,092 INFO org.apache.zookeeper.ClientCnxn: Client session
timed out, have not heard from server in 52481ms for sessionid
0x12c34f1783c00bf, closing socket connection and attempting reconnect
2010-12-08 06:46:44,200 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: Got ZooKeeper event,
state: Disconnected, type: None, path: null
2010-12-08 06:46:44,200 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: Got ZooKeeper event,
state: Disconnected, type: None, path: null
2010-12-08 06:46:44,201 WARN
org.apache.hadoop.hbase.client.HConnectionManager$ClientZKWatcher: No longer
connected to ZooKeeper, current state: Disconnected
2010-12-08 06:46:51,218 INFO org.apache.zookeeper.ClientCnxn: Opening socket
connection to server ip-10-245-202-207.ec2.internal/10.245.202.207:2181
2010-12-08 06:46:51,219 INFO org.apache.zookeeper.ClientCnxn: Socket
connection established to ip-10-245-202-207.ec2.internal/10.245.202.207:2181,
initiating session
2010-12-08 06:46:51,220 INFO org.apache.zookeeper.ClientCnxn: Session
establishment complete on server ip-10-245-202-207.ec2.internal/
10.245.202.207:2181, sessionid = 0x12c34f1783c00bf, negotiated timeout =
40000
2010-12-08 06:46:51,264 INFO org.apache.zookeeper.ZooKeeper: Session:
0x12c34f1783c00bf closed
2010-12-08 06:46:51,264 DEBUG
org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper:
<localhost:/hbase,org.apache.hadoop.hbase.client.HConnectionManager>Closed
connection with ZooKeeper; /hbase/root-region-server
2010-12-08 06:46:58,375 INFO org.apache.zookeeper.ClientCnxn: Opening socket
connection to server ip-10-245-202-207.ec2.internal/10.245.202.207:2181
2010-12-08 06:46:58,376 INFO org.apache.zookeeper.ClientCnxn: Socket
connection established to ip-10-245-202-207.ec2.internal/10.245.202.207:2181,
initiating session
2010-12-08 06:46:58,490 INFO org.apache.zookeeper.ClientCnxn: Unable to
reconnect to ZooKeeper service, session 0x12c34f1783c0002 has expired,
closing socket connection
2010-12-08 06:46:58,491 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: Got ZooKeeper event,
state: Expired, type: None, path: null
2010-12-08 06:46:58,491 ERROR
org.apache.hadoop.hbase.regionserver.HRegionServer: ZooKeeper session
expired
2010-12-08 06:46:58,491 FATAL
org.apache.hadoop.hbase.regionserver.HRegionServer: Aborting region server
serverName=ip-10-245-202-207.ec2.internal,60020,1289379196235,
load=(requests=73, regions=649, usedHeap=1983, maxHeap=1995): ZooKeeper
session expired


[2] DataNode log: nothing, just noticeable pause for almost 1 minute

2010-12-08 06:45:58,120 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /
127.0.0.1:50010, dest: /127.0.0.1:39194, bytes: 200, op: HDFS_READ, cliID:
DFSClient_hb_rs_ip-10-245-202-207.ec2.internal,60020,1289379196235_1289379196335,
offset: 74752, srvID: DS-137381847-10.245.202.207-50010-1287129392787,
blockid: blk_6449721912917715607_913675, duration: 3639142
2010-12-08 06:45:58,134 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /
127.0.0.1:50010, dest: /127.0.0.1:39195, bytes: 716, op: HDFS_READ, cliID:
DFSClient_hb_rs_ip-10-245-202-207.ec2.internal,60020,1289379196235_1289379196335,
offset: 74240, srvID: DS-137381847-10.245.202.207-50010-1287129392787,
blockid: blk_6449721912917715607_913675, duration: 427441
2010-12-08 06:45:58,148 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /
127.0.0.1:50010, dest: /127.0.0.1:39193, bytes: 1014972, op: HDFS_READ,
cliID:
DFSClient_hb_rs_ip-10-245-202-207.ec2.internal,60020,1289379196235_1289379196335,
offset: 66101760, srvID: DS-137381847-10.245.202.207-50010-1287129392787,
blockid: blk_825672696763040280_900371, duration: 6352102264
2010-12-08 06:46:44,192 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block
blk_-8677121983399284148_913676 src: /127.0.0.1:39196 dest: /127.0.0.1:50010
2010-12-08 06:46:51,263 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /
127.0.0.1:50010, dest: /127.0.0.1:53722, bytes: 264192, op: HDFS_READ,
cliID:
DFSClient_hb_rs_ip-10-245-202-207.ec2.internal,60020,1289379196235_1289379196335,
offset: 17161216, srvID: DS-137381847-10.245.202.207-50010-1287129392787,
blockid: blk_8449446729722678098_531164, duration: 401874426400
2010-12-08 06:46:51,333 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /
127.0.0.1:50010, dest: /127.0.0.1:53737, bytes: 264192, op: HDFS_READ,
cliID:
DFSClient_hb_rs_ip-10-245-202-207.ec2.internal,60020,1289379196235_1289379196335,
offset: 38253056, srvID: DS-137381847-10.245.202.207-50010-1287129392787,
blockid: blk_-4815658898725325280_624191, duration: 401614939902
2010-12-08 06:46:58,266 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: Client calls
recoverBlock(block=blk_7609568532460932683_912614, targets=[127.0.0.1:50010
])

[3] Zookeeper log: some warns

2010-12-08 02:50:46,100 - INFO  [ProcessThread:-1:PrepRequestProcessor@385]
- Processed session termination for sessionid: 0x12c34f1783d6eba
2010-12-08 02:50:46,104 - INFO  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1434] - Closed socket connection for
client /127.0.0.1:48900 which had sessionid 0x12c34f1783d6eba
2010-12-08 06:46:44,092 - WARN  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2181:NIOServerCnxn@633] - EndOfStreamException: Unable to
read additional data from client sessionid 0x12c34f1783c00bf, likely client
has closed socket
2010-12-08 06:46:44,661 - INFO  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1434] - Closed socket connection for
client /10.245.202.207:53464 which had sessionid 0x12c34f1783c00bf
2010-12-08 06:46:44,662 - WARN  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2181:NIOServerCnxn@633] - EndOfStreamException: Unable to
read additional data from client sessionid 0x12c34f1783c0002, likely client
has closed socket
2010-12-08 06:46:44,662 - INFO  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1434] - Closed socket connection for
client /10.245.202.207:37888 which had sessionid 0x12c34f1783c0002
2010-12-08 06:46:51,219 - INFO  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2181:NIOServerCnxn$Factory@250] - Accepted socket connection
from /10.245.202.207:33388
2010-12-08 06:46:51,219 - INFO  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2181:NIOServerCnxn@769] - Client attempting to renew session
0x12c34f1783c00bf at /10.245.202.207:33388
2010-12-08 06:46:51,220 - INFO  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1579] - Established session
0x12c34f1783c00bf with negotiated timeout 40000 for client /
10.245.202.207:33388
2010-12-08 06:46:51,220 - INFO  [ProcessThread:-1:PrepRequestProcessor@385]
- Processed session termination for sessionid: 0x12c34f1783c00bf
2010-12-08 06:46:51,223 - INFO  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1434] - Closed socket connection for
client /10.245.202.207:33388 which had sessionid 0x12c34f1783c00bf
2010-12-08 06:46:58,001 - INFO  [SessionTracker:ZooKeeperServer@315] -
Expiring session 0x12c34f1783c0002, timeout of 40000ms exceeded
2010-12-08 06:46:58,002 - INFO  [ProcessThread:-1:PrepRequestProcessor@385]
- Processed session termination for sessionid: 0x12c34f1783c0002
2010-12-08 06:46:58,376 - INFO  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2181:NIOServerCnxn$Factory@250] - Accepted socket connection
from /10.245.202.207:33394
2010-12-08 06:46:58,489 - INFO  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2181:NIOServerCnxn@769] - Client attempting to renew session
0x12c34f1783c0002 at /10.245.202.207:33394
2010-12-08 06:46:58,489 - INFO  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1572] - Invalid session 0x12c34f1783c0002
for client /10.245.202.207:33394, probably expired
2010-12-08 06:46:58,490 - INFO  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1434] - Closed socket connection for
client /10.245.202.207:33394 which had sessionid 0x12c34f1783c0002

[4] HMaster log: nothing

2010-12-08 06:45:02,246 INFO org.apache.hadoop.hbase.master.BaseScanner:
RegionManager.metaScanner scan of 648 row(s) of meta region {server:
10.245.202.207:60020, regionname: .META.,,1.1028785192, startKey: <>}
complete
2010-12-08 06:45:02,246 INFO org.apache.hadoop.hbase.master.BaseScanner: All
1 .META. region(s) scanned
2010-12-08 06:45:25,277 INFO org.apache.hadoop.hbase.master.BaseScanner:
RegionManager.rootScanner scanning meta region {server: 10.245.202.207:60020,
regionname: -ROOT-,,0.70236052, startKey: <>}
2010-12-08 06:45:27,132 INFO org.apache.hadoop.hbase.master.BaseScanner:
RegionManager.rootScanner scan of 1 row(s) of meta region {server:
10.245.202.207:60020, regionname: -ROOT-,,0.70236052, startKey: <>} complete
2010-12-08 06:45:35,367 INFO org.apache.hadoop.hbase.master.ServerManager: 1
region servers, 0 dead, average load 649.0
2010-12-08 06:45:45,563 INFO org.apache.hadoop.hbase.master.BaseScanner:
RegionManager.metaScanner scanning meta region {server: 10.245.202.207:60020,
regionname: .META.,,1.1028785192, startKey: <>}
2010-12-08 06:46:25,244 INFO org.apache.hadoop.hbase.master.BaseScanner:
RegionManager.rootScanner scanning meta region {server: 10.245.202.207:60020,
regionname: -ROOT-,,0.70236052, startKey: <>}
2010-12-08 06:46:35,369 INFO org.apache.hadoop.hbase.master.ServerManager: 1
region servers, 0 dead, average load 649.0
2010-12-08 06:46:44,154 INFO org.apache.hadoop.hbase.master.BaseScanner:
RegionManager.rootScanner scan of 1 row(s) of meta region {server:
10.245.202.207:60020, regionname: -ROOT-,,0.70236052, startKey: <>} complete
2010-12-08 06:46:58,005 INFO org.apache.hadoop.hbase.master.ServerManager:
ip-10-245-202-207.ec2.internal,60020,1289379196235 znode expired
2010-12-08 06:46:58,005 DEBUG org.apache.hadoop.hbase.master.ServerManager:
Added=ip-10-245-202-207.ec2.internal,60020,1289379196235 to dead servers,
added shutdown processing operation
2010-12-08 06:46:58,006 INFO org.apache.hadoop.hbase.master.RegionManager:
-ROOT- region unset (but not set to be reassigned)
2010-12-08 06:46:58,006 INFO org.apache.hadoop.hbase.master.RegionManager:
META region whose startkey is  removed from onlineMetaRegions
2010-12-08 06:46:58,006 DEBUG
org.apache.hadoop.hbase.master.RegionServerOperationQueue: Processing todo:
ProcessServerShutdown of ip-10-245-202-207.ec2.internal,60020,1289379196235
2010-12-08 06:46:58,006 INFO
org.apache.hadoop.hbase.master.RegionServerOperation: Process shutdown of
server ip-10-245-202-207.ec2.internal,60020,1289379196235: logSplit: false,
rootRescanned: false, numberOfMetaRegions: 1, onlineMetaRegions.size(): 0
2010-12-08 06:46:58,012 INFO org.apache.hadoop.hbase.regionserver.wal.HLog:
Splitting 34 hlog(s) in
hdfs://localhost/hbase/.logs/ip-10-245-202-207.ec2.internal,60020,1289379196235

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message