hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gaojinchao <gaojinc...@huawei.com>
Subject Re: Hmaster is OutOfMemory
Date Tue, 10 May 2011 05:17:36 GMT
Hbase version : 0.90.3RC0

It happened when creating table with Regions
I find master started needs so much memory when the cluster has 100K regions
It seems likes zkclientcnxn.

It seems master assigned region need improve.


top -c | grep 5834
5834 root      20   0 8875m 7.9g  11m S    2 50.5  33:53.19 /opt/jdk1.6.0_22/bin/java -Xmx8192m
-ea -XX:+UseConcMarkSweepGC -XX:+CMSIncrementalMode

Attaching to process ID 5834, please wait...
Debugger attached successfully.
Server compiler detected.
JVM version is 17.1-b03

using parallel threads in the new generation.
using thread-local object allocation.
Concurrent Mark-Sweep GC

Heap Configuration:
   MinHeapFreeRatio = 40
   MaxHeapFreeRatio = 70
   MaxHeapSize      = 8589934592 (8192.0MB)
   NewSize          = 21757952 (20.75MB)
   MaxNewSize       = 174456832 (166.375MB)
   OldSize          = 65404928 (62.375MB)
   NewRatio         = 7
   SurvivorRatio    = 8
   PermSize         = 21757952 (20.75MB)
   MaxPermSize      = 88080384 (84.0MB)

Heap Usage:
New Generation (Eden + 1 Survivor Space):
   capacity = 100335616 (95.6875MB)
   used     = 47094720 (44.91302490234375MB)
   free     = 53240896 (50.77447509765625MB)
   46.93719127612671% used
Eden Space:
   capacity = 89194496 (85.0625MB)
   used     = 35953600 (34.28802490234375MB)
   free     = 53240896 (50.77447509765625MB)
   40.30921369856723% used
From Space:
   capacity = 11141120 (10.625MB)
   used     = 11141120 (10.625MB)
   free     = 0 (0.0MB)
   100.0% used
To Space:
   capacity = 11141120 (10.625MB)
   used     = 0 (0.0MB)
   free     = 11141120 (10.625MB)
   0.0% used
concurrent mark-sweep generation:
   capacity = 8415477760 (8025.625MB)
   used     = 5107249280 (4870.6524658203125MB)
   free     = 3308228480 (3154.9725341796875MB)
   60.68876213155128% used
Perm Generation:
   capacity = 31199232 (29.75390625MB)
   used     = 18681784 (17.81633758544922MB)
   free     = 12517448 (11.937568664550781MB)
   59.87898676480241% used


-----邮件原件-----
发件人: jdcryans@gmail.com [mailto:jdcryans@gmail.com] 代表 Jean-Daniel Cryans
发送时间: 2011年5月10日 1:20
收件人: user@hbase.apache.org
主题: Re: Hmaster is OutOfMemory

It looks like the master entered a GC loop of death (since there are a
lot of "We slept 76166ms" messages) and finally died. Was it splitting
logs? Did you get a heap dump? Did you inspect it and can you tell
what was using all that space?

Thx,

J-D

2011/5/8 Gaojinchao <gaojinchao@huawei.com>:
> Hbase version 0.90.2:
> Hmaster has 8G memory,  It seems like not enough ? why it needs so much memory?(50K region)
>
> Other issue. Log is error:
> see http://wiki.apache.org/hadoop/Hbase/Troubleshooting#A9 should be see http://wiki.apache.org/hadoop/Hbase/Troubleshooting#A8
>
> Hmaster logs:
>
> 2011-05-06 19:31:09,924 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:20000-0x12fc3a17c070022
Creating (or updating) unassigned node for 2f19f33ae3f21ac4cb681f1662767d0c with OFFLINE state
> 2011-05-06 19:31:09,924 WARN org.apache.hadoop.hbase.util.Sleeper: We slept 76166ms instead
of 60000ms, this is likely due to a long garbage collecting pause and it's usually bad, see
http://wiki.apache.org/hadoop/Hbase/Troubleshooting#A9
> 2011-05-06 19:31:09,924 WARN org.apache.hadoop.hbase.util.Sleeper: We slept 16697ms instead
of 1000ms, this is likely due to a long garbage collecting pause and it's usually bad, see
http://wiki.apache.org/hadoop/Hbase/Troubleshooting#A9
> 2011-05-06 19:31:09,932 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Forcing
OFFLINE; was=ufdr,211007,1304669377398.696f124cc6ff82302f735c8413c6ac0b. state=CLOSED, ts=1304681364406
> 2011-05-06 19:31:09,932 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:20000-0x12fc3a17c070022
Creating (or updating) unassigned node for 696f124cc6ff82302f735c8413c6ac0b with OFFLINE state
> 2011-05-06 19:31:22,942 DEBUG org.apache.hadoop.hbase.master.handler.OpenedRegionHandler:
Opened region ufdr,071415,1304668656420.aa026fbb27a25b0fe54039c00108dad6. on 157-5-100-9,20020,1304678135900
> 2011-05-06 19:31:22,942 DEBUG org.apache.hadoop.hbase.master.handler.OpenedRegionHandler:
Handling OPENED event for 7a75bac2028fba1529075225a3755c4c; deleting unassigned node
> 2011-05-06 19:31:22,942 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:20000-0x12fc3a17c070022
Deleting existing unassigned node for 7a75bac2028fba1529075225a3755c4c that is in expected
state RS_ZK_REGION_OPENED
> 2011-05-06 19:31:38,986 WARN org.apache.hadoop.hbase.util.Sleeper: We slept 12948ms instead
of 1000ms, this is likely due to a long garbage collecting pause and it's usually bad, see
http://wiki.apache.org/hadoop/Hbase/Troubleshooting#A9
> 2011-05-06 19:31:39,059 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Sent
CLOSE to serverName=158-1-91-101,20020,1304663941215, load=(requests=0, regions=5374, usedHeap=79,
maxHeap=8165) for region ufdr,082288,1304668716520.144efd9c886f5567abc9a229eb380c76.
> 2011-05-06 19:31:39,059 INFO org.apache.hadoop.hbase.master.HMaster: balance hri=ufdr,050142,1304668531330.2d56755e7138a93b243a7c5363d1ac22.,
src=158-1-91-101,20020,1304663941215, dest=157-5-100-9,20020,1304678135900
>
> 2011-05-06 19:36:16,570 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection
to server /157.5.100.11:2181
> 2011-05-06 19:36:16,570 INFO org.apache.zookeeper.ClientCnxn: Socket connection established
to 157-2011-05-06 19:31:54,741 WARN org.apache.hadoop.hbase.util.Sleeper: We slept 15681ms
instead of 1000ms, this is likely due to a long garbage collecting pause and it's usually
bad, see http://wiki.apache.org/hadoop/Hbase/Troubleshooting#A9
> 2011-05-06 19:32:11,419 WARN org.apache.hadoop.hbase.util.Sleeper: We slept 16574ms instead
of 1000ms, this is likely due to a long garbage collecting pause and it's usually bad, see
http://wiki.apache.org/hadoop/Hbase/Troubleshooting#A9
> 2011-05-06 19:32:23,176 WARN org.apache.hadoop.hbase.util.Sleeper: We slept 11756ms instead
of 1000ms, this is likely due to a long garbage collecting pause and it's usually bad, see
http://wiki.apache.org/hadoop/Hbase/Troubleshooting#A9
> 2011-05-06 19:32:23,245 INFO org.apache.zookeeper.ClientCnxn: Unable to read additional
data from server sessionid 0x12fc3a17c070054, likely server has closed socket, closing socket
connection and attempting reconnect
> 2011-05-06 19:32:36,902 WARN org.apache.hadoop.hbase.util.Sleeper: We slept 13657ms instead
of 1000ms, this is likely due to a long garbage collecting pause and it's usually bad, see
http://wiki.apache.org/hadoop/Hbase/Troubleshooting#A9
> 2011-05-06 19:35:44,084 WARN org.apache.hadoop.hbase.zookeeper.ZKUtil: hconnection-0x12fc3a17c070054
Unable to set watcher on znode /hbase/root-region-server
> org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss
for /hbase/root-region-server
>        at org.apache.zookeeper.KeeperException.create(KeeperException.java:90)
>        at org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
>        at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:819)
>        at org.apache.hadoop.hbase.zookeeper.ZKUtil.watchAndCheckExists(ZKUtil.java:232)
>        at org.apache.hadoop.hbase.zookeeper.ZooKeeperNodeTracker.start(ZooKeeperNodeTracker.java:72)
>        at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.setupZookeeperTrackers(HConnectionManager.java:308)
>        at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.resetZooKeeperTrackers(HConnectionManager.java:319)
>        at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.abort(HConnectionManager.java:1333)
>        at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.connectionEvent(ZooKeeperWatcher.java:328)
>        at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:246)
>        at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:541)
>        at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:517)
> 2011-05-06 19:35:44,096 WARN org.apache.hadoop.hbase.util.Sleeper: We slept 78032ms instead
of 1000ms, this is likely due to a long garbage collecting pause and it's usually bad, see
http://wiki.apache.org/hadoop/Hbase/Troubleshooting#A9
> 2011-05-06 19:35:44,095 ERROR org.apache.hadoop.hbase.executor.EventHandler: Caught throwable
while processing event RS_ZK_REGION_OPENED
> java.lang.OutOfMemoryError: Java heap space
>        at java.util.HashMap.addEntry(HashMap.java:753)
>        at java.util.HashMap.put(HashMap.java:385)
>        at java.util.HashMap.putAll(HashMap.java:524)
>        at org.apache.hadoop.hbase.master.AssignmentManager.updateTimers(AssignmentManager.java:630)
>        at org.apache.hadoop.hbase.master.AssignmentManager.regionOnline(AssignmentManager.java:607)
>        at org.apache.hadoop.hbase.master.handler.OpenedRegionHandler.process(OpenedRegionHandler.java:97)
>        at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:151)
>        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>        at java.lang.Thread.run(Thread.java:662)
> 2011-05-06 19:35:59,981 DEBUG org.apache.hadoop.hbase.master.handler.OpenedRegionHandler:
Handling OPENED event for accac1bddaa4bf102bcfa6506b19839d; deleting unassigned node
> 2011-05-06 19:35:44,093 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Starting
unassignment of region ufdr,050142,1304668531330.2d56755e7138a93b243a7c5363d1ac22. (offlining)
> 2011-05-06 19:35:59,981 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection
to server /157.5.100.11:2181
> 2011-05-06 19:35:59,981 WARN org.apache.hadoop.hbase.util.Sleeper: We slept 15884ms instead
of 1000ms, this is likely due to a long garbage collecting pause and it's usually bad, see
http://wiki.apache.org/hadoop/Hbase/Troubleshooting#A9
> 2011-05-06 19:35:44,097 ERROR org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher: hconnection-0x12fc3a17c070054
Received unexpected KeeperException, re-throwing exception
> org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss
for /hbase/root-region-server
>        at org.apache.zookeeper.KeeperException.create(KeeperException.java:90)
>        at org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
>        at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:819)
>        at org.apache.hadoop.hbase.zookeeper.ZKUtil.watchAndCheckExists(ZKUtil.java:232)
>        at org.apache.hadoop.hbase.zookeeper.ZooKeeperNodeTracker.start(ZooKeeperNodeTracker.java:72)
>        at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.setupZookeeperTrackers(HConnectionManager.java:308)
>        at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.resetZooKeeperTrackers(HConnectionManager.java:319)
>        at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.abort(HConnectionManager.java:1333)
>        at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.connectionEvent(ZooKeeperWatcher.java:328)
>        at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:246)
>        at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:541)
>        at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:517)5-100-11/157.5.100.11:2181,
initiating session
> 2011-05-06 19:36:16,570 WARN org.apache.hadoop.hbase.util.Sleeper: We slept 16488ms instead
of 1000ms, this is likely due to a long garbage collecting pause and it's usually bad, see
http://wiki.apache.org/hadoop/Hbase/Troubleshooting#A9
> 2011-05-06 19:36:16,570 INFO org.apache.zookeeper.ClientCnxn: Socket connection established
to 157-5-100-11/157.5.100.11:2181, initiating session
> 2011-05-06 19:36:16,571 INFO org.apache.zookeeper.ClientCnxn: Client session timed out,
have not heard from server in 16489ms for sessionid 0x12fc3a17c070054, closing socket connection
and attempting reconnect
> 2011-05-06 19:36:00,082 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:20000-0x12fc3a17c070022
Deleting existing unassigned node for accac1bddaa4bf102bcfa6506b19839d that is in expected
state RS_ZK_REGION_OPENED
> 2011-05-06 19:36:16,570 FATAL org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation:
Unexpected exception during initialization, aborting
> org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss
for /hbase/root-region-server
>        at org.apache.zookeeper.KeeperException.create(KeeperException.java:90)
>        at org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
>        at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:819)
>        at org.apache.hadoop.hbase.zookeeper.ZKUtil.watchAndCheckExists(ZKUtil.java:232)
>        at org.apache.hadoop.hbase.zookeeper.ZooKeeperNodeTracker.start(ZooKeeperNodeTracker.java:72)
>        at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.setupZookeeperTrackers(HConnectionManager.java:308)
>        at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.resetZooKeeperTrackers(HConnectionManager.java:319)
>        at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.abort(HConnectionManager.java:1333)
>        at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.connectionEvent(ZooKeeperWatcher.java:328)
>        at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:246)
>        at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:541)
>        at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:517)
> 2011-05-06 19:36:16,572 INFO org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation:
Reconnected successfully. This disconnect could have been caused by a network partition or
a long-running GC pause, either way it's recommended that you verify your environment.
> 2011-05-06 19:36:16,572 INFO org.apache.zookeeper.ClientCnxn: EventThread shut down
> 2011-05-06 19:36:16,572 WARN org.apache.hadoop.hbase.master.CatalogJanitor: Failed scan
of catalog table
> org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.hbase.UnknownScannerException:
Name: 3135625757230134514
>        at org.apache.hadoop.hbase.regionserver.HRegionServer.next(HRegionServer.java:1794)
>        at org.apache.hadoop.hbase.regionserver.HRegionServer.next(HRegionServer.java:1782)
>        at sun.reflect.GeneratedMethodAccessor31.invoke(Unknown Source)
>        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>        at java.lang.reflect.Method.invoke(Method.java:597)
>        at org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:570)
>        at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1039)
>
>        at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:771)
>        at org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:257)
>        at $Proxy6.next(Unknown Source)
>        at org.apache.hadoop.hbase.catalog.MetaReader.fullScan(MetaReader.java:264)
>        at org.apache.hadoop.hbase.catalog.MetaReader.fullScan(MetaReader.java:237)
>        at org.apache.hadoop.hbase.master.CatalogJanitor.scan(CatalogJanitor.java:116)
>        at org.apache.hadoop.hbase.master.CatalogJanitor.chore(CatalogJanitor.java:85)
>        at org.apache.hadoop.hbase.Chore.run(Chore.java:66)
>
Mime
View raw message