hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stack <st...@duboce.net>
Subject Re: debugging a possible long garbage collection
Date Fri, 29 Apr 2011 16:13:52 GMT
On Thu, Apr 28, 2011 at 9:48 PM, Garrett Wu <wugarrett@gmail.com> wrote:
>   2. The master started splitting logs for the hadoop11 region.

Splitting logs is a pretty intense operation.  It could have run up
your master heap and CPU such that it brought on a long GC pause.
Look for a big gap in your logging just before the zk session timeout
(In 0.92.0, log splitting is distributed across the cluster; the
master does nought but orchestrate the process).

In you regionserver node log, there does seem to a big pause just
before the zk session timeout.  Similarily in the master (at about the
point where you've inserted the 'oh no').

For more on what Ted and Michel are referring to, see
http://hbase.apache.org/book.html#jvm

St.Ack


> Some snippets from the logs are pasted below.  Does anyone know what may
> have caused this?  Was the hang really a garbage collection?  How might I
> prevent this from happening again?
>
> Thanks,
> Garrett
>
>
> // From the master (hadoop01)
> 2011-04-28 07:59:24,004 INFO
> org.apache.hadoop.hbase.zookeeper.RegionServerTracker: RegionServer
> ephemeral node deleted, processing expiration [hadoop11,60020,1302898865706]
> 2011-04-28 07:59:24,005 DEBUG org.apache.hadoop.hbase.master.ServerManager:
> Added=hadoop11,60020,1302898865706 to dead servers, submitted shutdown handler
> to be executed, root=false, meta=false
> 2011-04-28 07:59:24,005 INFO
> org.apache.hadoop.hbase.master.handler.ServerShutdownHandler: Splitting logs
> for hadoop11,60020,1302898865706
> [ ... ]
> 2011-04-28 07:59:24,984 DEBUG
> org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Creating writer
> path=hdfs://hadoop01/hbase/processed/03819443e22488ec15418f22c67f02f5/recovered.edits/0000000000060056396
> region=03819443e22488ec15418f22c67f02f5
> 2011-04-28 07:59:25,101 INFO
> org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter: Using syncFs
> -- HDFS-200
> 2011-04-28 07:59:25,101 DEBUG
> org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Creating writer
> path=hdfs://hadoop01/hbase/processed/c56e10565f9e97fd19fb0879f59fd8b0/recovered.edits/0000000000060060517
> region=c56e10565f9e97fd19fb0879f59fd8b0
> [ // oh no! a 3 minute gap! ]
> 2011-04-28 08:03:58,571 DEBUG
> org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Pushed=77384 entries
> from
> hdfs://hadoop01/hbase/.logs/hadoop11,60020,1302898865706/hadoop11%3A60020.1303174692580
> 2011-04-28 08:03:58,572 DEBUG
> org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Splitting hlog 2 of
> 29:
> hdfs://hadoop01/hbase/.logs/hadoop11,60020,1302898865706/hadoop11%3A60020.1303174703672,
> length=63864464
> [ ... ]
> 2011-04-28 08:03:59,055 WARN org.apache.hadoop.hbase.util.Sleeper: We slept
> 273689ms instead of 1000ms, this is likely due to a long garbage collecting
> pause and it's usually bad, see
> http://wiki.apache.org/hadoop/Hbase/Troubleshooting#A9
> [ ... ]
> 2011-04-28 08:04:00,261 FATAL org.apache.hadoop.hbase.master.HMaster:
> master:60000-0x12f4c26d7652910 master:60000-0x12f4c26d7652910 received
> expired from ZooKeeper, aborting
> org.apache.zookeeper.KeeperException$SessionExpiredException:
> KeeperErrorCode = Session expired
>        at
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.connectionEvent(ZooKeeperWatcher.java:328)
>        at
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:246)
>        at
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:530)
>        at
> org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:506)
> 2011-04-28 08:04:00,261 INFO org.apache.hadoop.hbase.master.HMaster:
> Aborting
>
> // From the regionserver (hadoop11)
> 2011-04-28 07:41:06,851 DEBUG
> org.apache.hadoop.hbase.io.hfile.LruBlockCache: LRU Stats: total=1013.02 MB,
> free=210.38 MB, max=1.19 GB, blocks=14843, accesses=5097608, hits=43675,
> hitRatio=0.85%%, cachingAccesses=1016414, cachingHits=43675,
> cachingHitsRatio=4.29%%, evictions=528, evicted=957896,
> evictedPerRun=1814.197021484375
> 2011-04-28 07:46:06,851 DEBUG
> org.apache.hadoop.hbase.io.hfile.LruBlockCache: LRU Stats: total=1013.02 MB,
> free=210.38 MB, max=1.19 GB, blocks=14843, accesses=5097608, hits=43675,
> hitRatio=0.85%%, cachingAccesses=1016414, cachingHits=43675,
> cachingHitsRatio=4.29%%, evictions=528, evicted=957896,
> evictedPerRun=1814.197021484375
> 2011-04-28 07:51:06,852 DEBUG
> org.apache.hadoop.hbase.io.hfile.LruBlockCache: LRU Stats: total=1013.02 MB,
> free=210.38 MB, max=1.19 GB, blocks=14843, accesses=5097608, hits=43675,
> hitRatio=0.85%%, cachingAccesses=1016414, cachingHits=43675,
> cachingHitsRatio=4.29%%, evictions=528, evicted=957896,
> evictedPerRun=1814.197021484375
> 2011-04-28 07:59:39,198 INFO org.apache.zookeeper.ClientCnxn: Client session
> timed out, have not heard from server in 285991ms for sessionid
> 0x12f4c26d7652917, closing socket connection and attempting reconnect
> 2011-04-28 07:59:40,600 INFO org.apache.zookeeper.ClientCnxn: Opening socket
> connection to server hadoop01/172.30.160.15:2181
> 2011-04-28 07:59:40,601 INFO org.apache.zookeeper.ClientCnxn: Socket
> connection established to hadoop01/172.30.160.15:2181, initiating session
> 2011-04-28 07:59:40,602 DEBUG
> org.apache.hadoop.hbase.regionserver.LogRoller: Hlog roll period 3600000ms
> elapsed
> 2011-04-28 07:59:40,652 INFO org.apache.zookeeper.ClientCnxn: Session
> establishment complete on server hadoop01/172.30.160.15:2181, sessionid =
> 0x12f4c26d7652917, negotiated timeout = 40000
> 2011-04-28 07:59:41,288 INFO org.apache.zookeeper.ClientCnxn: Client session
> timed out, have not heard from server in 285993ms for sessionid
> 0x12f4c26d7652920, closing socket connection and attempting reconnect
> [ ... ]
> 2011-04-28 07:59:23,174 INFO
> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation:
> Reconnected successfully. This disconnect could have been caused by a
> network partition or a long-running GC pause, either way it's recommended
> that you verify your environment.
> 2011-04-28 07:59:23,174 INFO org.apache.zookeeper.ClientCnxn: EventThread
> shut down
> 2011-04-28 07:59:24,004 INFO org.apache.zookeeper.ClientCnxn: Unable to read
> additional data from server sessionid 0x12f4c26d7652920, likely server has
> closed socket, closing socket connection and attempting reconnect
> 2011-04-28 07:59:25,773 INFO org.apache.zookeeper.ClientCnxn: Opening socket
> connection to server hadoop01/172.30.160.15:2181
> 2011-04-28 07:59:25,774 INFO org.apache.zookeeper.ClientCnxn: Socket
> connection established to hadoop01/172.30.160.15:2181, initiating session
> 2011-04-28 07:59:25,775 INFO org.apache.zookeeper.ClientCnxn: Unable to
> reconnect to ZooKeeper service, session 0x12f4c26d7652920 has expired,
> closing socket connection
> 2011-04-28 07:59:25,775 FATAL
> org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server
> serverName=hadoop11.sdc2.svcmot.com,60020,1302898865706, load=(requests=0,
> regions=46, usedHeap=4130, maxHeap=6117):
> regionserver:60020-0x12f4c26d7652920-0x12f4c26d7652920
> regionserver:60020-0x12f4c26d7652920-0x12f4c26d7652920 received expired from
> ZooKeeper, aborting
> org.apache.zookeeper.KeeperException$SessionExpiredException:
> KeeperErrorCode = Session expired
>        at
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.connectionEvent(ZooKeeperWatcher.java:328)
>        at
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:246)
>        at
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:530)
>        at
> org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:506)
>

Mime
View raw message