hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alex Baranau <alex.barano...@gmail.com>
Subject Re: RegionServer crashed with ZK session expired
Date Fri, 10 Dec 2010 11:45:01 GMT
Hi J-D,

Thank you for response. Right, I see in logs the pause. I see it in both ZK
log and regionserver (I pasted respective parts of logs with previous
message). Just not sure about the cause of the pause.

Anyways, I restarted things with GC being logged this time. Will inspect the
issue if I face it again.

Alex Baranau
----
Sematext :: http://sematext.com/

On Thu, Dec 9, 2010 at 7:44 PM, Jean-Daniel Cryans <jdcryans@apache.org>wrote:

> Hey Alex,
>
> It's either the client or the server box that wasn't responding,
> either way they didn't talk for 52 seconds when your session timeout
> was set at 40 seconds. I suggest you also take a look at the ZK log,
> it should tell you exactly when the session was expired.
>
> J-D
>
> On Thu, Dec 9, 2010 at 9:19 AM, Alex Baranau <alex.baranov.v@gmail.com>
> wrote:
> > Hello guys,
> >
> > I got regionserver crash and trying to find out why. I found
> > * in regionserver log [1] : ZK session expired, and before that slow hlog
> > edits;
> > * nothing in DataNode log [2] and HMaster[4] log;
> > * some warns in ZK log [3] with EndOfStreamException.
> >
> > I wonder can it be because of long GC and thus "stop the world" pause for
> > about a minute (seen in logs)? Unfortunately GC wasn't logged (this is an
> > old test box). If it can be caused by GC I'd restart the server with GC
> > logging, but if not, I'd like to figure out the cause...
> >
> > HBase version is 0.89.20100924.
> >
> > Thank you,
> >
> > Alex Baranau
> > ----
> > Sematext :: http://sematext.com/
> >
> > [1] Regionserver log: ZK session expired, before that slow hlog edits
> >
> > 2010-12-08 06:43:15,403 INFO
> org.apache.hadoop.hbase.regionserver.HRegion:
> > Starting compaction on region agg2,system|tq|s\x00|in licensed License
> > project\x00|,1291790014280.75a3871bc968e0438e763cde7e313c2c.
> > 2010-12-08 06:43:15,404 INFO org.apache.hadoop.hbase.regionserver.Store:
> > Started compaction of 4 file(s) in a of agg2,system|tq|s\x00|in licensed
> > License project\x00|,1291790014280.75a3871bc968e0438e763cde7e313c2c.,
> > hasReferences=true, into
> > hdfs://localhost/hbase/agg2/75a3871bc968e0438e763cde7e313c2c/.tmp,
> > sequenceid=1153119174
> > 2010-12-08 06:43:24,825 DEBUG
> > org.apache.hadoop.hbase.io.hfile.LruBlockCache: Block cache LRU eviction
> > started; Attempting to free 39.93 MB of total=339.31 MB
> > 2010-12-08 06:43:24,829 DEBUG
> > org.apache.hadoop.hbase.io.hfile.LruBlockCache: Block cache LRU eviction
> > completed; freed=39.95 MB, total=299.36 MB, single=147.93 MB,
> multi=154.82
> > MB, memory=33.28 MB
> > 2010-12-08 06:43:50,871 DEBUG
> > org.apache.hadoop.hbase.io.hfile.LruBlockCache: Block cache LRU eviction
> > started; Attempting to free 40 MB of total=339.39 MB
> > 2010-12-08 06:43:50,875 DEBUG
> > org.apache.hadoop.hbase.io.hfile.LruBlockCache: Block cache LRU eviction
> > completed; freed=40.01 MB, total=299.37 MB, single=147.86 MB,
> multi=154.96
> > MB, memory=33.28 MB
> > 2010-12-08 06:43:57,095 WARN
> org.apache.hadoop.hbase.regionserver.wal.HLog:
> > IPC Server handler 4 on 60020 took 5338ms appending an edit to hlog;
> > editcount=8882
> > 2010-12-08 06:44:25,086 WARN
> org.apache.hadoop.hbase.regionserver.wal.HLog:
> > IPC Server handler 24 on 60020 took 5658ms appending an edit to hlog;
> > editcount=13054
> > 2010-12-08 06:44:55,815 WARN
> org.apache.hadoop.hbase.regionserver.wal.HLog:
> > IPC Server handler 5 on 60020 took 5884ms appending an edit to hlog;
> > editcount=13472
> > 2010-12-08 06:45:33,003 DEBUG
> > org.apache.hadoop.hbase.io.hfile.LruBlockCache: LRU Stats: total=337.5
> MB,
> > free=61.68 MB, max=399.18 MB, blocks=9015, accesses=553728227,
> > hits=40478965, hitRatio=7.31%%, evictions=135536, evicted=86306453,
> > evictedPerRun=636.77880859375
> > 2010-12-08 06:45:45,333 INFO
> > org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Forced flushing of
> > raw3,,1291780273350.73d076a68191712c68d16c4805973561. because global
> > memstore limit of 798.4m exceeded; currently 812.8m and flushing till
> 698.6m
> > 2010-12-08 06:45:45,363 DEBUG
> org.apache.hadoop.hbase.regionserver.HRegion:
> > Started memstore flush for region
> > raw3,,1291780273350.73d076a68191712c68d16c4805973561.. Current region
> > memstore size 45.0m
> > 2010-12-08 06:45:45,364 DEBUG
> org.apache.hadoop.hbase.regionserver.HRegion:
> > Finished snapshotting, commencing flushing stores
> > 2010-12-08 06:45:58,107 INFO org.apache.hadoop.hbase.regionserver.Store:
> > Renaming flushed file at
> >
> hdfs://localhost/hbase/raw3/73d076a68191712c68d16c4805973561/.tmp/6597521672231455831
> > to
> >
> hdfs://localhost/hbase/raw3/73d076a68191712c68d16c4805973561/c/3157254097847060580
> > 2010-12-08 06:45:58,164 INFO org.apache.hadoop.hbase.regionserver.Store:
> > Added
> >
> hdfs://localhost/hbase/raw3/73d076a68191712c68d16c4805973561/c/3157254097847060580,
> > entries=8356, sequenceid=1153215388, memsize=1.7m, filesize=73.2k to
> > raw3,,1291780273350.73d076a68191712c68d16c4805973561.
> > 2010-12-08 06:46:44,092 INFO org.apache.zookeeper.ClientCnxn: Client
> session
> > timed out, have not heard from server in 52481ms for sessionid
> > 0x12c34f1783c0002, closing socket connection and attempting reconnect
> > 2010-12-08 06:46:44,092 INFO org.apache.zookeeper.ClientCnxn: Client
> session
> > timed out, have not heard from server in 52481ms for sessionid
> > 0x12c34f1783c00bf, closing socket connection and attempting reconnect
> > 2010-12-08 06:46:44,200 INFO
> > org.apache.hadoop.hbase.regionserver.HRegionServer: Got ZooKeeper event,
> > state: Disconnected, type: None, path: null
> > 2010-12-08 06:46:44,200 INFO
> > org.apache.hadoop.hbase.regionserver.HRegionServer: Got ZooKeeper event,
> > state: Disconnected, type: None, path: null
> > 2010-12-08 06:46:44,201 WARN
> > org.apache.hadoop.hbase.client.HConnectionManager$ClientZKWatcher: No
> longer
> > connected to ZooKeeper, current state: Disconnected
> > 2010-12-08 06:46:51,218 INFO org.apache.zookeeper.ClientCnxn: Opening
> socket
> > connection to server ip-10-245-202-207.ec2.internal/10.245.202.207:2181
> > 2010-12-08 06:46:51,219 INFO org.apache.zookeeper.ClientCnxn: Socket
> > connection established to ip-10-245-202-207.ec2.internal/
> 10.245.202.207:2181,
> > initiating session
> > 2010-12-08 06:46:51,220 INFO org.apache.zookeeper.ClientCnxn: Session
> > establishment complete on server ip-10-245-202-207.ec2.internal/
> > 10.245.202.207:2181, sessionid = 0x12c34f1783c00bf, negotiated timeout =
> > 40000
> > 2010-12-08 06:46:51,264 INFO org.apache.zookeeper.ZooKeeper: Session:
> > 0x12c34f1783c00bf closed
> > 2010-12-08 06:46:51,264 DEBUG
> > org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper:
> >
> <localhost:/hbase,org.apache.hadoop.hbase.client.HConnectionManager>Closed
> > connection with ZooKeeper; /hbase/root-region-server
> > 2010-12-08 06:46:58,375 INFO org.apache.zookeeper.ClientCnxn: Opening
> socket
> > connection to server ip-10-245-202-207.ec2.internal/10.245.202.207:2181
> > 2010-12-08 06:46:58,376 INFO org.apache.zookeeper.ClientCnxn: Socket
> > connection established to ip-10-245-202-207.ec2.internal/
> 10.245.202.207:2181,
> > initiating session
> > 2010-12-08 06:46:58,490 INFO org.apache.zookeeper.ClientCnxn: Unable to
> > reconnect to ZooKeeper service, session 0x12c34f1783c0002 has expired,
> > closing socket connection
> > 2010-12-08 06:46:58,491 INFO
> > org.apache.hadoop.hbase.regionserver.HRegionServer: Got ZooKeeper event,
> > state: Expired, type: None, path: null
> > 2010-12-08 06:46:58,491 ERROR
> > org.apache.hadoop.hbase.regionserver.HRegionServer: ZooKeeper session
> > expired
> > 2010-12-08 06:46:58,491 FATAL
> > org.apache.hadoop.hbase.regionserver.HRegionServer: Aborting region
> server
> > serverName=ip-10-245-202-207.ec2.internal,60020,1289379196235,
> > load=(requests=73, regions=649, usedHeap=1983, maxHeap=1995): ZooKeeper
> > session expired
> >
> >
> > [2] DataNode log: nothing, just noticeable pause for almost 1 minute
> >
> > 2010-12-08 06:45:58,120 INFO
> > org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /
> > 127.0.0.1:50010, dest: /127.0.0.1:39194, bytes: 200, op: HDFS_READ,
> cliID:
> >
> DFSClient_hb_rs_ip-10-245-202-207.ec2.internal,60020,1289379196235_1289379196335,
> > offset: 74752, srvID: DS-137381847-10.245.202.207-50010-1287129392787,
> > blockid: blk_6449721912917715607_913675, duration: 3639142
> > 2010-12-08 06:45:58,134 INFO
> > org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /
> > 127.0.0.1:50010, dest: /127.0.0.1:39195, bytes: 716, op: HDFS_READ,
> cliID:
> >
> DFSClient_hb_rs_ip-10-245-202-207.ec2.internal,60020,1289379196235_1289379196335,
> > offset: 74240, srvID: DS-137381847-10.245.202.207-50010-1287129392787,
> > blockid: blk_6449721912917715607_913675, duration: 427441
> > 2010-12-08 06:45:58,148 INFO
> > org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /
> > 127.0.0.1:50010, dest: /127.0.0.1:39193, bytes: 1014972, op: HDFS_READ,
> > cliID:
> >
> DFSClient_hb_rs_ip-10-245-202-207.ec2.internal,60020,1289379196235_1289379196335,
> > offset: 66101760, srvID: DS-137381847-10.245.202.207-50010-1287129392787,
> > blockid: blk_825672696763040280_900371, duration: 6352102264
> > 2010-12-08 06:46:44,192 INFO
> > org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block
> > blk_-8677121983399284148_913676 src: /127.0.0.1:39196 dest: /
> 127.0.0.1:50010
> > 2010-12-08 06:46:51,263 INFO
> > org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /
> > 127.0.0.1:50010, dest: /127.0.0.1:53722, bytes: 264192, op: HDFS_READ,
> > cliID:
> >
> DFSClient_hb_rs_ip-10-245-202-207.ec2.internal,60020,1289379196235_1289379196335,
> > offset: 17161216, srvID: DS-137381847-10.245.202.207-50010-1287129392787,
> > blockid: blk_8449446729722678098_531164, duration: 401874426400
> > 2010-12-08 06:46:51,333 INFO
> > org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /
> > 127.0.0.1:50010, dest: /127.0.0.1:53737, bytes: 264192, op: HDFS_READ,
> > cliID:
> >
> DFSClient_hb_rs_ip-10-245-202-207.ec2.internal,60020,1289379196235_1289379196335,
> > offset: 38253056, srvID: DS-137381847-10.245.202.207-50010-1287129392787,
> > blockid: blk_-4815658898725325280_624191, duration: 401614939902
> > 2010-12-08 06:46:58,266 INFO
> > org.apache.hadoop.hdfs.server.datanode.DataNode: Client calls
> > recoverBlock(block=blk_7609568532460932683_912614, targets=[
> 127.0.0.1:50010
> > ])
> >
> > [3] Zookeeper log: some warns
> >
> > 2010-12-08 02:50:46,100 - INFO
>  [ProcessThread:-1:PrepRequestProcessor@385]
> > - Processed session termination for sessionid: 0x12c34f1783d6eba
> > 2010-12-08 02:50:46,104 - INFO  [NIOServerCxn.Factory:
> > 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1434] - Closed socket connection for
> > client /127.0.0.1:48900 which had sessionid 0x12c34f1783d6eba
> > 2010-12-08 06:46:44,092 - WARN  [NIOServerCxn.Factory:
> > 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@633] - EndOfStreamException: Unable
> to
> > read additional data from client sessionid 0x12c34f1783c00bf, likely
> client
> > has closed socket
> > 2010-12-08 06:46:44,661 - INFO  [NIOServerCxn.Factory:
> > 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1434] - Closed socket connection for
> > client /10.245.202.207:53464 which had sessionid 0x12c34f1783c00bf
> > 2010-12-08 06:46:44,662 - WARN  [NIOServerCxn.Factory:
> > 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@633] - EndOfStreamException: Unable
> to
> > read additional data from client sessionid 0x12c34f1783c0002, likely
> client
> > has closed socket
> > 2010-12-08 06:46:44,662 - INFO  [NIOServerCxn.Factory:
> > 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1434] - Closed socket connection for
> > client /10.245.202.207:37888 which had sessionid 0x12c34f1783c0002
> > 2010-12-08 06:46:51,219 - INFO  [NIOServerCxn.Factory:
> > 0.0.0.0/0.0.0.0:2181:NIOServerCnxn$Factory@250] - Accepted socket
> connection
> > from /10.245.202.207:33388
> > 2010-12-08 06:46:51,219 - INFO  [NIOServerCxn.Factory:
> > 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@769] - Client attempting to renew
> session
> > 0x12c34f1783c00bf at /10.245.202.207:33388
> > 2010-12-08 06:46:51,220 - INFO  [NIOServerCxn.Factory:
> > 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1579] - Established session
> > 0x12c34f1783c00bf with negotiated timeout 40000 for client /
> > 10.245.202.207:33388
> > 2010-12-08 06:46:51,220 - INFO
>  [ProcessThread:-1:PrepRequestProcessor@385]
> > - Processed session termination for sessionid: 0x12c34f1783c00bf
> > 2010-12-08 06:46:51,223 - INFO  [NIOServerCxn.Factory:
> > 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1434] - Closed socket connection for
> > client /10.245.202.207:33388 which had sessionid 0x12c34f1783c00bf
> > 2010-12-08 06:46:58,001 - INFO  [SessionTracker:ZooKeeperServer@315] -
> > Expiring session 0x12c34f1783c0002, timeout of 40000ms exceeded
> > 2010-12-08 06:46:58,002 - INFO
>  [ProcessThread:-1:PrepRequestProcessor@385]
> > - Processed session termination for sessionid: 0x12c34f1783c0002
> > 2010-12-08 06:46:58,376 - INFO  [NIOServerCxn.Factory:
> > 0.0.0.0/0.0.0.0:2181:NIOServerCnxn$Factory@250] - Accepted socket
> connection
> > from /10.245.202.207:33394
> > 2010-12-08 06:46:58,489 - INFO  [NIOServerCxn.Factory:
> > 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@769] - Client attempting to renew
> session
> > 0x12c34f1783c0002 at /10.245.202.207:33394
> > 2010-12-08 06:46:58,489 - INFO  [NIOServerCxn.Factory:
> > 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1572] - Invalid session
> 0x12c34f1783c0002
> > for client /10.245.202.207:33394, probably expired
> > 2010-12-08 06:46:58,490 - INFO  [NIOServerCxn.Factory:
> > 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1434] - Closed socket connection for
> > client /10.245.202.207:33394 which had sessionid 0x12c34f1783c0002
> >
> > [4] HMaster log: nothing
> >
> > 2010-12-08 06:45:02,246 INFO org.apache.hadoop.hbase.master.BaseScanner:
> > RegionManager.metaScanner scan of 648 row(s) of meta region {server:
> > 10.245.202.207:60020, regionname: .META.,,1.1028785192, startKey: <>}
> > complete
> > 2010-12-08 06:45:02,246 INFO org.apache.hadoop.hbase.master.BaseScanner:
> All
> > 1 .META. region(s) scanned
> > 2010-12-08 06:45:25,277 INFO org.apache.hadoop.hbase.master.BaseScanner:
> > RegionManager.rootScanner scanning meta region {server:
> 10.245.202.207:60020,
> > regionname: -ROOT-,,0.70236052, startKey: <>}
> > 2010-12-08 06:45:27,132 INFO org.apache.hadoop.hbase.master.BaseScanner:
> > RegionManager.rootScanner scan of 1 row(s) of meta region {server:
> > 10.245.202.207:60020, regionname: -ROOT-,,0.70236052, startKey: <>}
> complete
> > 2010-12-08 06:45:35,367 INFO
> org.apache.hadoop.hbase.master.ServerManager: 1
> > region servers, 0 dead, average load 649.0
> > 2010-12-08 06:45:45,563 INFO org.apache.hadoop.hbase.master.BaseScanner:
> > RegionManager.metaScanner scanning meta region {server:
> 10.245.202.207:60020,
> > regionname: .META.,,1.1028785192, startKey: <>}
> > 2010-12-08 06:46:25,244 INFO org.apache.hadoop.hbase.master.BaseScanner:
> > RegionManager.rootScanner scanning meta region {server:
> 10.245.202.207:60020,
> > regionname: -ROOT-,,0.70236052, startKey: <>}
> > 2010-12-08 06:46:35,369 INFO
> org.apache.hadoop.hbase.master.ServerManager: 1
> > region servers, 0 dead, average load 649.0
> > 2010-12-08 06:46:44,154 INFO org.apache.hadoop.hbase.master.BaseScanner:
> > RegionManager.rootScanner scan of 1 row(s) of meta region {server:
> > 10.245.202.207:60020, regionname: -ROOT-,,0.70236052, startKey: <>}
> complete
> > 2010-12-08 06:46:58,005 INFO
> org.apache.hadoop.hbase.master.ServerManager:
> > ip-10-245-202-207.ec2.internal,60020,1289379196235 znode expired
> > 2010-12-08 06:46:58,005 DEBUG
> org.apache.hadoop.hbase.master.ServerManager:
> > Added=ip-10-245-202-207.ec2.internal,60020,1289379196235 to dead servers,
> > added shutdown processing operation
> > 2010-12-08 06:46:58,006 INFO
> org.apache.hadoop.hbase.master.RegionManager:
> > -ROOT- region unset (but not set to be reassigned)
> > 2010-12-08 06:46:58,006 INFO
> org.apache.hadoop.hbase.master.RegionManager:
> > META region whose startkey is  removed from onlineMetaRegions
> > 2010-12-08 06:46:58,006 DEBUG
> > org.apache.hadoop.hbase.master.RegionServerOperationQueue: Processing
> todo:
> > ProcessServerShutdown of
> ip-10-245-202-207.ec2.internal,60020,1289379196235
> > 2010-12-08 06:46:58,006 INFO
> > org.apache.hadoop.hbase.master.RegionServerOperation: Process shutdown of
> > server ip-10-245-202-207.ec2.internal,60020,1289379196235: logSplit:
> false,
> > rootRescanned: false, numberOfMetaRegions: 1, onlineMetaRegions.size(): 0
> > 2010-12-08 06:46:58,012 INFO
> org.apache.hadoop.hbase.regionserver.wal.HLog:
> > Splitting 34 hlog(s) in
> >
> hdfs://localhost/hbase/.logs/ip-10-245-202-207.ec2.internal,60020,1289379196235
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message