hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From 茅旭峰 <m9s...@gmail.com>
Subject Re: One of the regionserver aborted, then the master shut down itself
Date Tue, 15 Mar 2011 15:32:10 GMT
Thanks J-D for your reply.

It looks like HBASE-3617 will be included in 0.92, then when will 0.92 be
released?

Yes, you're right, we launched tens of threads, putting values of 4MB on
average, endless.
Does the region server meant to die because of OOM? I thought it's region
servers'
responsibilty to flush memory stores into HFDS, the limitation while doing
insertion endlessly
should be the size of HDFS, rather than java heap memory(we set 4GB java
heap for region
server).

Today, we cleaned up the HDFS, rerun the stress tests, I mean inserting
endlessly.
With java memory monitor tools, like jconsole, we find that the java heap of
master
is also keeping increasing, another OOM is expected now, though not happened
so far.
Is the master meant to die in this regarding?

Our keys are SHA1 hashed, which should spread uniformly. But from the web
page(master:60010),
we can see most requests are handled only by one region server, and in the
master log,
there are lots of region split, and eventually, the regions are spreaded
uniformly among the region
servers, is this workflow correct?

Thanks again for your time, J-D.

Mao Xu-Feng

On Tue, Mar 15, 2011 at 3:16 AM, Jean-Daniel Cryans <jdcryans@apache.org>wrote:

> For why the Master died, this looks like
> https://issues.apache.org/jira/browse/HBASE-3617
>
> The region server died on OutOfMemoryError: Java heap space, and I can
> see that it's at 4GB on 4GB so you must be inserting really large
> values from a lot of clients at the same time. Is that right?
>
> J-D
>
> On Mon, Mar 14, 2011 at 6:43 AM, 茅旭峰 <m9suns@gmail.com> wrote:
> > Hi,
> >
> > We are trying a small hbase environment, including 2-node masters, 4-node
> > regionservers, and 3-node zookeeper cluster.
> > Now we're doing stress tests, using cells of bigger size(4mb - 15mb).
> What
> > we run into was,
> >
> > 1. one of the regionserver(10.241.67.33) aborted due to OOM, with log
> > ====
> > 2011-03-14 20:35:28,235 INFO
> org.apache.hadoop.hbase.regionserver.HRegion:
> > Starting compaction on region
> >
> table2,SXGdRIcvExFco4zZIPwhcKjjYlo=,1300096043004.8b00c3cb40e3e827f46aaaeeafce044d.
> > 2011-03-14 20:35:28,236 INFO
> > org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Flush thread woke
> up
> > with memory above low water.
> > 2011-03-14 20:35:28,236 ERROR
> > org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Cache flusher
> failed
> > for entry
> >
> org.apache.hadoop.hbase.regionserver.MemStoreFlusher$WakeupFlushThread@2ca74909
> >
> > ......
> >
> > 2011-03-14 20:36:40,396 INFO
> > org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Flush thread woke
> up
> > with memory above low water.
> > 2011-03-14 20:36:40,396 ERROR
> > org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Cache flusher
> failed
> > for entry
> >
> org.apache.hadoop.hbase.regionserver.MemStoreFlusher$WakeupFlushThread@3b97477e
> > ......
> > 2011-03-14 20:41:36,541 INFO
> org.apache.hadoop.hbase.regionserver.HRegion:
> > aborted compaction on region
> >
> table2,Ig92Q4M9wK3GagLmUMDhHvlQBr0=,1300097018887.62a5a22f09f37e099679b2f6a12490b6.
> > after 1mins, 39sec
> > 2011-03-14 20:41:36,541 WARN
> > org.apache.hadoop.hbase.regionserver.HRegionServer: Received close for
> > region we are not serving; 944c234d8f64eccd0833d86349623309
> > 2011-03-14 20:41:36,544 FATAL
> > org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region
> server
> > serverName=cloud140,60020,1299747751421, load=(requests=0, regions=255,
> > usedHeap=3982, maxHeap=3983): Uncaught exception in service thread
> > regionserver60020.compactor
> > java.lang.OutOfMemoryError: Java heap space
> >        at java.nio.HeapByteBuffer.<init>(HeapByteBuffer.java:39)
> >        at java.nio.ByteBuffer.allocate(ByteBuffer.java:312)
> >        at
> > org.apache.hadoop.hbase.io.hfile.HFile$Reader.decompress(HFile.java:1093)
> >        at
> > org.apache.hadoop.hbase.io.hfile.HFile$Reader.readBlock(HFile.java:1036)
> >        at
> >
> org.apache.hadoop.hbase.io.hfile.HFile$Reader$Scanner.next(HFile.java:1276)
> >        at
> >
> org.apache.hadoop.hbase.regionserver.StoreFileScanner.next(StoreFileScanner.java:87)
> >        at
> >
> org.apache.hadoop.hbase.regionserver.KeyValueHeap.next(KeyValueHeap.java:82)
> >        at
> >
> org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:262)
> >        at
> >
> org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:326)
> >        at
> > org.apache.hadoop.hbase.regionserver.Store.compact(Store.java:927)
> >        at
> > org.apache.hadoop.hbase.regionserver.Store.compact(Store.java:733)
> >        at
> >
> org.apache.hadoop.hbase.regionserver.HRegion.compactStores(HRegion.java:769)
> >        at
> >
> org.apache.hadoop.hbase.regionserver.HRegion.compactStores(HRegion.java:714)
> >        at
> >
> org.apache.hadoop.hbase.regionserver.CompactSplitThread.run(CompactSplitThread.java:81)
> > ......
> > 2011-03-14 20:41:38,067 INFO org.apache.hadoop.hbase.regionserver.Leases:
> > regionserver60020 closing leases
> > 2011-03-14 20:41:38,067 INFO org.apache.hadoop.hbase.regionserver.Leases:
> > regionserver60020 closed leases
> > 2011-03-14 20:41:38,085 INFO
> >
> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation:
> > Closed zookeeper sessionid=0x32e9a2f2678000a
> > 2011-03-14 20:41:38,093 INFO org.apache.zookeeper.ZooKeeper: Session:
> > 0x32e9a2f2678000a closed
> > 2011-03-14 20:41:38,093 INFO org.apache.zookeeper.ClientCnxn: EventThread
> > shut down
> > 2011-03-14 20:41:38,100 INFO org.apache.zookeeper.ZooKeeper: Session:
> > 0x12e9a2e21be0011 closed
> > 2011-03-14 20:41:38,101 INFO org.apache.zookeeper.ClientCnxn: EventThread
> > shut down
> > 2011-03-14 20:41:38,102 INFO
> > org.apache.hadoop.hbase.regionserver.HRegionServer: regionserver60020
> > exiting
> > 2011-03-14 20:41:38,117 INFO
> > org.apache.hadoop.hbase.regionserver.ShutdownHook: Shutdown hook
> starting;
> > hbase.shutdown.hook=true; fsShutdownHook=Thread[Thread-14,5,main]
> > 2011-03-14 20:41:38,117 INFO
> > org.apache.hadoop.hbase.regionserver.HRegionServer: STOPPED: Shutdown
> hook
> > 2011-03-14 20:41:38,117 INFO
> > org.apache.hadoop.hbase.regionserver.ShutdownHook: Starting fs shutdown
> hook
> > thread.
> > 2011-03-14 20:41:38,220 INFO
> > org.apache.hadoop.hbase.regionserver.ShutdownHook: Shutdown hook
> finished.
> >
> > ====
> >
> > 2.the master shutdown itself, with log
> >
> > ====
> > 2011-03-14 20:41:03,991 DEBUG
> >
> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation:
> > Cache hit for row <> in tableName .META.: location server cloud134:60020,
> > location region name .META.,,1.1028785192
> > 2011-03-14 20:41:35,415 FATAL org.apache.hadoop.hbase.master.HMaster:
> Remote
> > unexpected exception
> > java.io.IOException: Call to /10.241.67.33:60020 failed on local
> exception:
> > java.io.EOFException
> >        at
> >
> org.apache.hadoop.hbase.ipc.HBaseClient.wrapException(HBaseClient.java:806)
> >        at
> > org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:775)
> >        at
> > org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:257)
> >        at $Proxy6.closeRegion(Unknown Source)
> >        at
> >
> org.apache.hadoop.hbase.master.ServerManager.sendRegionClose(ServerManager.java:589)
> >        at
> >
> org.apache.hadoop.hbase.master.AssignmentManager.unassign(AssignmentManager.java:1093)
> >        at
> >
> org.apache.hadoop.hbase.master.AssignmentManager$TimeoutMonitor.chore(AssignmentManager.java:1672)
> >        at org.apache.hadoop.hbase.Chore.run(Chore.java:66)
> > Caused by: java.io.EOFException
> >        at java.io.DataInputStream.readInt(DataInputStream.java:375)
> >        at
> >
> org.apache.hadoop.hbase.ipc.HBaseClient$Connection.receiveResponse(HBaseClient.java:539)
> >        at
> >
> org.apache.hadoop.hbase.ipc.HBaseClient$Connection.run(HBaseClient.java:477)
> > 2011-03-14 20:41:35,417 INFO org.apache.hadoop.hbase.master.HMaster:
> > Aborting
> > 2011-03-14 20:41:35,421 INFO
> > org.apache.hadoop.hbase.master.AssignmentManager: Failed connect to
> > serverName=cloud140,60020,1299747751421, load=(requests=0, regions=0,
> > usedHeap=0, maxHeap=0), message=Connection refused,
> > region=1bd40fb0466d9b0b52f93f9af41b3a8e
> > 2011-03-14 20:41:35,424 INFO
> > org.apache.hadoop.hbase.master.AssignmentManager: Failed connect to
> > serverName=cloud140,60020,1299747751421, load=(requests=0, regions=0,
> > usedHeap=0, maxHeap=0), message=Connection refused,
> > region=1c564f7c72355b78369aa1467d9852e5
> > 2011-03-14 20:41:35,425 INFO
> > org.apache.hadoop.hbase.master.AssignmentManager: Failed connect to
> > serverName=cloud140,60020,1299747751421, load=(requests=0, regions=0,
> > usedHeap=0, maxHeap=0), message=Connection refused,
> > region=1cabffd7a9213938ac1181f05c19f2d7
> > .....
> > 2011-03-14 20:41:36,013 INFO
> > org.apache.hadoop.hbase.master.AssignmentManager: Failed connect to
> > serverName=cloud140,60020,1299747751421, load=(requests=0, regions=0,
> > usedHeap=0, maxHeap=0), message=Connection refused,
> > region=f694004eeb2dea68a65ec3b378764805
> > 2011-03-14 20:41:36,017 INFO
> > org.apache.hadoop.hbase.master.AssignmentManager: Failed connect to
> > serverName=cloud140,60020,1299747751421, load=(requests=0, regions=0,
> > usedHeap=0, maxHeap=0), message=Connection refused,
> > region=fa3f2e8c175d794aa8b18e3e8995d382
> > 2011-03-14 20:41:36,025 INFO
> > org.apache.hadoop.hbase.master.AssignmentManager$TimeoutMonitor:
> > cloud135:60000.timeoutMonitor exiting
> > 2011-03-14 20:41:36,093 DEBUG org.apache.hadoop.hbase.master.HMaster:
> > Stopping service threads
> > 2011-03-14 20:41:36,093 INFO org.apache.hadoop.ipc.HBaseServer: Stopping
> > server on 60000
> > 2011-03-14 20:41:36,093 INFO
> org.apache.hadoop.hbase.master.CatalogJanitor:
> > cloud135:60000-CatalogJanitor exiting
> > 2011-03-14 20:41:36,094 INFO org.apache.hadoop.hbase.master.HMaster$1:
> > cloud135:60000-BalancerChore exiting
> > 2011-03-14 20:41:36,094 INFO org.apache.hadoop.ipc.HBaseServer: Stopping
> IPC
> > Server listener on 60000
> > 2011-03-14 20:41:36,095 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> Server
> > handler 0 on 60000: exiting
> > 2011-03-14 20:41:36,095 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> Server
> > handler 4 on 60000: exiting
> > 2011-03-14 20:41:36,095 INFO org.apache.hadoop.ipc.HBaseServer: Stopping
> IPC
> > Server Responder
> > 2011-03-14 20:41:36,095 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> Server
> > handler 3 on 60000: exiting
> > 2011-03-14 20:41:36,095 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> Server
> > handler 1 on 60000: exiting
> > 2011-03-14 20:41:36,095 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> Server
> > handler 8 on 60000: exiting
> > 2011-03-14 20:41:36,095 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> Server
> > handler 7 on 60000: exiting
> > 2011-03-14 20:41:36,095 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> Server
> > handler 5 on 60000: exiting
> > 2011-03-14 20:41:36,095 INFO org.apache.hadoop.hbase.master.HMaster:
> > Stopping infoServer
> > 2011-03-14 20:41:36,095 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> Server
> > handler 2 on 60000: exiting
> > 2011-03-14 20:41:36,095 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> Server
> > handler 6 on 60000: exiting
> > 2011-03-14 20:41:36,096 INFO org.apache.hadoop.ipc.HBaseServer: IPC
> Server
> > handler 9 on 60000: exiting
> > 2011-03-14 20:41:36,095 INFO org.apache.hadoop.hbase.master.LogCleaner:
> > master-cloud135:60000.oldLogCleaner exiting
> > 2011-03-14 20:41:36,096 INFO org.mortbay.log: Stopped
> > SelectChannelConnector@0.0.0.0:60010
> > 2011-03-14 20:41:36,222 DEBUG
> > org.apache.hadoop.hbase.catalog.CatalogTracker: Stopping catalog tracker
> > org.apache.hadoop.hbase.catalog.CatalogTracker@62770d2e
> > 2011-03-14 20:41:36,323 INFO
> >
> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation:
> > Closed zookeeper sessionid=0x12e9a2e21be0038
> > 2011-03-14 20:41:36,328 INFO org.apache.zookeeper.ClientCnxn: EventThread
> > shut down
> > 2011-03-14 20:41:36,328 INFO org.apache.zookeeper.ZooKeeper: Session:
> > 0x12e9a2e21be0038 closed
> > 2011-03-14 20:41:36,340 INFO org.apache.zookeeper.ClientCnxn: EventThread
> > shut down
> > 2011-03-14 20:41:36,340 INFO org.apache.zookeeper.ZooKeeper: Session:
> > 0x22e9a2e22390033 closed
> > 2011-03-14 20:41:36,340 INFO org.apache.hadoop.hbase.master.HMaster:
> HMaster
> > main thread exiting
> >
> > ====
> >
> > the died regionserver is cloud140(10.241.67.33). What confusing us is why
> > the death of regionserver would cause the master shut down itself.
> > On the other hand, the standby master started, but we still could not
> > manipulate the table because too many WALs to recover.
> >
> > I guess it relates to the previous thread 'HMaster startup is very slow,
> and
> > always run into out-of-memory issue' I submited.
> >
> > I guess both issues are about gc tuning, but I can only get guidelines
> like
> > use CMS policy, any others? maybe particularly for
> > bigger cell.
> >
> > Thanks and regards,
> >
> > Mao Xu-Feng
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message