hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jean-Daniel Cryans <jdcry...@apache.org>
Subject Re: HBase 0.19.0: Xcievers / shutdown / JVM GC settings
Date Mon, 23 Feb 2009 15:40:39 GMT
Jean-Adrien,

Regards #2, I googled the error and stumbled upon
https://issues.apache.org/jira/browse/HADOOP-4866 which blames the client
software. In our case, it would mean HBase is doing something wrong. This is
a bug, could you fill in a Jira?

Regards #3, we have been tackling with the same problem here at
openplaces.com and we managed to get our MEM usage just a little bit lower
than the max mem in our nodes. But this is a very useful info, could you add
an entry in the Troubleshooting page regards that?

Thank you for taking the time posting all that good info!

J-D

On Mon, Feb 23, 2009 at 4:56 AM, Jean-Adrien <adv1@jeanjean.ch> wrote:

>
> Hello,
>
> I recently upgraded to HBase / Hadoop 0.19.0 and I have some notes to
> share:
>
> == 1. Xcievers. ==================
>
> First of all thanks you for this new version.
> Regarding to the Hadoop setting about Xcievers threads,
> as noticed by stack
> (http://www.nabble.com/Datanode-Xceivers-td21372227.html#a21567289)
> the settings for xcievers dfs.datanode.socket.write.timeout cat be enabled
> without the problems noticed with 0.18.0
>
> == 2. Stoping HBase ==============
> I noticed a new problem problem:
> If, for any reason, one of my datanode become unavailable for a while, my
> dfs become in an instable state, and I have to restart the cluster. But
> when
> I want to stop HBase one of the region server never stops because it
> retries
> to reach an 'unavailable' block:
>
> ## region server log
> 2009-02-23 09:50:09,864 WARN org.apache.hadoop.hdfs.DFSClient: Error
> Recovery for block blk_7545556036225037274_1820952 failed  because recovery
> from primary datanode 192.168.1.10:50010 failed 6 times. Marking primary
> datanode as bad.
> 2009-02-23 09:50:10,864 WARN org.apache.hadoop.hdfs.DFSClient: Error
> Recovery for block blk_7545556036225037274_1820952 bad datanode[1]
> 192.168.1.10:50010
> 2009-02-23 09:50:10,864 WARN org.apache.hadoop.hdfs.DFSClient: Error
> Recovery for block blk_7545556036225037274_1820952 in pipeline
> 192.168.1.13:50010, 192.168.1.10:50010: bad datanode 192.168.1.10:50010
> 2009-02-23 09:50:10,869 WARN org.apache.hadoop.hdfs.DFSClient: Error
> Recovery for block blk_7545556036225037274_1820952 failed  because recovery
> from primary datanode 192.168.1.13:50010 failed 1 times. Will retry...
> 2009-02-23 09:50:11,869 WARN org.apache.hadoop.hdfs.DFSClient: Error
> Recovery for block blk_7545556036225037274_1820952 bad datanode[1]
> 192.168.1.10:50010
> 2009-02-23 09:50:11,869 WARN org.apache.hadoop.hdfs.DFSClient: Error
> Recovery for block blk_7545556036225037274_1820952 in pipeline
> 192.168.1.13:50010, 192.168.1.10:50010: bad datanode 192.168.1.10:50010
> 2009-02-23 09:50:11,875 WARN org.apache.hadoop.hdfs.DFSClient: Error
> Recovery for block blk_7545556036225037274_1820952 failed  because recovery
> from primary datanode 192.168.1.13:50010 failed 2 times. Will retry...
>
> This is logged in an endless loop,
> The datanode says the block is 'already commited' ? Corresponding error in
> Datanode:
>
> ### datanode log
> 2009-02-23 09:50:43,532 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: Client calls
> recoverBlock(block=blk_7545556036225037274_1820952,
> targets=[192.168.1.10:50010])
> 2009-02-23 09:50:43,534 INFO org.apache.hadoop.ipc.Server: IPC Server
> handler 5 on 50020, call recoverBlock(blk_7545556036225037274_1820952,
> false, [Lorg.apache.hadoop.hdfs.protocol.DatanodeInfo;@1879f77) from
> 192.168.1.13:48587: error: org.apache.hadoop.ipc.RemoteException:
> java.io.IOException: blk_7545556036225037274_1820952 is already commited,
> storedBlock == null.
>        at
>
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.nextGenerationStampForBlock(FSNamesystem.java:4536)
>        at
>
> org.apache.hadoop.hdfs.server.namenode.NameNode.nextGenerationStamp(NameNode.java:402)
>        at sun.reflect.GeneratedMethodAccessor16.invoke(Unknown Source)
>        at
>
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>        at java.lang.reflect.Method.invoke(Method.java:597)
>        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:452)
>        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:892)
>
> org.apache.hadoop.ipc.RemoteException: java.io.IOException:
> blk_7545556036225037274_1820952 is already commited, storedBlock == null.
>        at
>
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.nextGenerationStampForBlock(FSNamesystem.java:4536)
>        at
>
> org.apache.hadoop.hdfs.server.namenode.NameNode.nextGenerationStamp(NameNode.java:402)
>        at sun.reflect.GeneratedMethodAccessor16.invoke(Unknown Source)
>        at
>
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>        at java.lang.reflect.Method.invoke(Method.java:597)
>        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:452)
>        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:892)
>
>        at org.apache.hadoop.ipc.Client.call(Client.java:696)
>        at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216)
>        at $Proxy4.nextGenerationStamp(Unknown Source)
>        at
>
> org.apache.hadoop.hdfs.server.datanode.DataNode.syncBlock(DataNode.java:1466)
>        at
>
> org.apache.hadoop.hdfs.server.datanode.DataNode.recoverBlock(DataNode.java:1440)
>        at
>
> org.apache.hadoop.hdfs.server.datanode.DataNode.recoverBlock(DataNode.java:1506)
>        at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)
>        at
>
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>        at java.lang.reflect.Method.invoke(Method.java:597)
>        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:452)
>        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:892)
>
> A kill (-TERM) on the region server launches the SIGTERM handler thread,
> but
> the loop which try to reach the block does not stop.
> here is a thread dump of the region server at this time:
>
> ### thread dump of the region server
> Full thread dump Java HotSpot(TM) Server VM (10.0-b23 mixed mode):
>
> "Thread-16" prio=10 tid=0x081c3c00 nid=0x6e81 in Object.wait()
> [0x8f74f000..0x8f74fec0]
>   java.lang.Thread.State: WAITING (on object monitor)
>        at java.lang.Object.wait(Native Method)
>        at java.lang.Thread.join(Thread.java:1143)
>        - locked <0x93d0d060> (a java.lang.Thread)
>        at java.lang.Thread.join(Thread.java:1196)
>        at org.apache.hadoop.hbase.util.Threads.shutdown(Threads.java:78)
>        at org.apache.hadoop.hbase.util.Threads.shutdown(Threads.java:66)
>        at
>
> org.apache.hadoop.hbase.regionserver.HRegionServer$ShutdownThread.run(HRegionServer.java:814)
>
> "SIGTERM handler" daemon prio=10 tid=0x08dbcc00 nid=0x6e80 in Object.wait()
> [0x8edaf000..0x8edaff40]
>   java.lang.Thread.State: WAITING (on object monitor)
>        at java.lang.Object.wait(Native Method)
>        at java.lang.Thread.join(Thread.java:1143)
>        - locked <0x93d0f3c0> (a
> org.apache.hadoop.hbase.regionserver.HRegionServer$ShutdownThread)
>        at java.lang.Thread.join(Thread.java:1196)
>        at
> java.lang.ApplicationShutdownHooks.run(ApplicationShutdownHooks.java:79)
>        at java.lang.Shutdown.runHooks(Shutdown.java:89)
>        at java.lang.Shutdown.sequence(Shutdown.java:133)
>        at java.lang.Shutdown.exit(Shutdown.java:178)
>        - locked <0xb0d3ed60> (a java.lang.Class for java.lang.Shutdown)
>        at java.lang.Terminator$1.handle(Terminator.java:35)
>        at sun.misc.Signal$1.run(Signal.java:195)
>        at java.lang.Thread.run(Thread.java:619)
>
> "Attach Listener" daemon prio=10 tid=0x081e8800 nid=0x6e50 waiting on
> condition [0x00000000..0x00000000]
>   java.lang.Thread.State: RUNNABLE
>
> "IPC Client (47) connection to /192.168.1.13:50020 from an unknown user"
> daemon prio=10 tid=0x8e97a400 nid=0x39d1 in Object.wait()
> [0x8ec81000..0x8ec81fc0]
>   java.lang.Thread.State: TIMED_WAITING (on object monitor)
>        at java.lang.Object.wait(Native Method)
>        at
> org.apache.hadoop.ipc.Client$Connection.waitForWork(Client.java:395)
>        - locked <0xa44a36a0> (a org.apache.hadoop.ipc.Client$Connection)
>        at org.apache.hadoop.ipc.Client$Connection.run(Client.java:437)
>
> "IPC Client (47) connection to /192.168.1.10:50020 from an unknown user"
> daemon prio=10 tid=0x8e5f1c00 nid=0x523f in Object.wait()
> [0x8ecd3000..0x8ecd3140]
>   java.lang.Thread.State: TIMED_WAITING (on object monitor)
>        at java.lang.Object.wait(Native Method)
>        at
> org.apache.hadoop.ipc.Client$Connection.waitForWork(Client.java:395)
>        - locked <0xa448ca50> (a org.apache.hadoop.ipc.Client$Connection)
>        at org.apache.hadoop.ipc.Client$Connection.run(Client.java:437)
>
> "BlockFSInputStreamReferenceQueueChecker" daemon prio=10 tid=0x085a9400
> nid=0x8c8 waiting on condition [0x8ee6c000..0x8ee6cdc0]
>   java.lang.Thread.State: WAITING (parking)
>        at sun.misc.Unsafe.park(Native Method)
>        - parking to wait for  <0x93e76068> (a
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
>        at java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
>        at
>
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1925)
>        at java.util.concurrent.DelayQueue.take(DelayQueue.java:160)
>        at
>
> java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:582)
>        at
>
> java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:575)
>        at
>
> java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:946)
>        at
>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:906)
>        at java.lang.Thread.run(Thread.java:619)
>
> "IPC Server handler 9 on 60020" daemon prio=10 tid=0x0882e000 nid=0x889
> waiting on condition [0x8eebd000..0x8eebdfc0]
>   java.lang.Thread.State: WAITING (parking)
>        at sun.misc.Unsafe.park(Native Method)
>        - parking to wait for  <0x93d0e9c8> (a
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
>        at java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
>        at
>
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1925)
>        at
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:358)
>        at
> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:881)
>
> "IPC Server handler 8 on 60020" daemon prio=10 tid=0x0882cc00 nid=0x888
> waiting on condition [0x8ef0e000..0x8ef0f040]
>   java.lang.Thread.State: WAITING (parking)
>        at sun.misc.Unsafe.park(Native Method)
>        - parking to wait for  <0x93d0e9c8> (a
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
>        at java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
>        at
>
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1925)
>        at
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:358)
>        at
> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:881)
>
> "IPC Server handler 7 on 60020" daemon prio=10 tid=0x0882b800 nid=0x887
> waiting on condition [0x8ef5f000..0x8ef5fec0]
>   java.lang.Thread.State: WAITING (parking)
>        at sun.misc.Unsafe.park(Native Method)
>        - parking to wait for  <0x93d0e9c8> (a
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
>        at java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
>        at
>
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1925)
>        at
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:358)
>        at
> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:881)
>
> "IPC Server handler 6 on 60020" daemon prio=10 tid=0x0882a400 nid=0x886
> waiting on condition [0x8efb0000..0x8efb0f40]
>   java.lang.Thread.State: WAITING (parking)
>        at sun.misc.Unsafe.park(Native Method)
>        - parking to wait for  <0x93d0e9c8> (a
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
>        at java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
>        at
>
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1925)
>        at
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:358)
>        at
> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:881)
>
> "IPC Server handler 5 on 60020" daemon prio=10 tid=0x08829000 nid=0x885
> waiting on condition [0x8f001000..0x8f001dc0]
>   java.lang.Thread.State: WAITING (parking)
>        at sun.misc.Unsafe.park(Native Method)
>        - parking to wait for  <0x93d0e9c8> (a
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
>        at java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
>        at
>
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1925)
>        at
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:358)
>        at
> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:881)
>
> "IPC Server handler 4 on 60020" daemon prio=10 tid=0x083bcc00 nid=0x884
> waiting on condition [0x8f052000..0x8f052e40]
>   java.lang.Thread.State: WAITING (parking)
>        at sun.misc.Unsafe.park(Native Method)
>        - parking to wait for  <0x93d0e9c8> (a
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
>        at java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
>        at
>
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1925)
>        at
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:358)
>        at
> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:881)
>
> "IPC Server handler 3 on 60020" daemon prio=10 tid=0x083bb800 nid=0x883
> waiting on condition [0x8f0a3000..0x8f0a40c0]
>   java.lang.Thread.State: WAITING (parking)
>        at sun.misc.Unsafe.park(Native Method)
>        - parking to wait for  <0x93d0e9c8> (a
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
>        at java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
>        at
>
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1925)
>        at
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:358)
>        at
> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:881)
>
> "IPC Server handler 2 on 60020" daemon prio=10 tid=0x083ba400 nid=0x882
> waiting on condition [0x8f0f4000..0x8f0f5140]
>   java.lang.Thread.State: WAITING (parking)
>        at sun.misc.Unsafe.park(Native Method)
>        - parking to wait for  <0x93d0e9c8> (a
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
>        at java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
>        at
>
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1925)
>        at
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:358)
>        at
> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:881)
>
> "IPC Server handler 1 on 60020" daemon prio=10 tid=0x083b9400 nid=0x881
> waiting on condition [0x8f145000..0x8f145fc0]
>   java.lang.Thread.State: WAITING (parking)
>        at sun.misc.Unsafe.park(Native Method)
>        - parking to wait for  <0x93d0e9c8> (a
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
>        at java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
>        at
>
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1925)
>        at
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:358)
>        at
> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:881)
>
> "IPC Server handler 0 on 60020" daemon prio=10 tid=0x083b8400 nid=0x880
> waiting on condition [0x8f196000..0x8f197040]
>   java.lang.Thread.State: WAITING (parking)
>        at sun.misc.Unsafe.park(Native Method)
>        - parking to wait for  <0x93d0e9c8> (a
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
>        at java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
>        at
>
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1925)
>        at
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:358)
>        at
> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:881)
>
> "IPC Server listener on 60020" daemon prio=10 tid=0x083bfc00 nid=0x87f
> runnable [0x8f1e7000..0x8f1e7ec0]
>   java.lang.Thread.State: RUNNABLE
>        at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
>        at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:215)
>        at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:65)
>        at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:69)
>        - locked <0x93d0c090> (a sun.nio.ch.Util$1)
>        - locked <0x93d0c120> (a java.util.Collections$UnmodifiableSet)
>        - locked <0x93d0c0b0> (a sun.nio.ch.EPollSelectorImpl)
>        at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:80)
>        at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:84)
>        at
> org.apache.hadoop.hbase.ipc.HBaseServer$Listener.run(HBaseServer.java:299)
>
> "IPC Server Responder" daemon prio=10 tid=0x083bec00 nid=0x87e runnable
> [0x8f238000..0x8f238f40]
>   java.lang.Thread.State: RUNNABLE
>        at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
>        at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:215)
>        at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:65)
>        at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:69)
>        - locked <0x93d0c0f0> (a sun.nio.ch.Util$1)
>        - locked <0x93d0c100> (a java.util.Collections$UnmodifiableSet)
>        - locked <0x93d0be30> (a sun.nio.ch.EPollSelectorImpl)
>        at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:80)
>        at
> org.apache.hadoop.hbase.ipc.HBaseServer$Responder.run(HBaseServer.java:458)
>
> "SocketListener0-1" prio=10 tid=0x088d5000 nid=0x87c in Object.wait()
> [0x8f2da000..0x8f2dae40]
>   java.lang.Thread.State: TIMED_WAITING (on object monitor)
>        at java.lang.Object.wait(Native Method)
>        at org.mortbay.util.ThreadPool$PoolThread.run(ThreadPool.java:522)
>        - locked <0x93e39bc8> (a org.mortbay.util.ThreadPool$PoolThread)
>
> "SocketListener0-0" prio=10 tid=0x08ab5000 nid=0x87b in Object.wait()
> [0x8f32b000..0x8f32c0c0]
>   java.lang.Thread.State: TIMED_WAITING (on object monitor)
>        at java.lang.Object.wait(Native Method)
>        at org.mortbay.util.ThreadPool$PoolThread.run(ThreadPool.java:522)
>        - locked <0x93e39920> (a org.mortbay.util.ThreadPool$PoolThread)
>
> "Acceptor ServerSocket[addr=0.0.0.0/0.0.0.0,port=0,localport=60030]"
> prio=10
> tid=0x084f6800 nid=0x87a runnable [0x8f37c000..0x8f37d140]
>   java.lang.Thread.State: RUNNABLE
>        at java.net.PlainSocketImpl.socketAccept(Native Method)
>        at java.net.PlainSocketImpl.accept(PlainSocketImpl.java:384)
>        - locked <0x93e382a8> (a java.net.SocksSocketImpl)
>        at java.net.ServerSocket.implAccept(ServerSocket.java:453)
>        at java.net.ServerSocket.accept(ServerSocket.java:421)
>        at
> org.mortbay.util.ThreadedServer.acceptSocket(ThreadedServer.java:432)
>        at
> org.mortbay.util.ThreadedServer$Acceptor.run(ThreadedServer.java:631)
>
> "SessionScavenger" daemon prio=10 tid=0x088f1400 nid=0x879 waiting on
> condition [0x8f3cd000..0x8f3cdfc0]
>   java.lang.Thread.State: TIMED_WAITING (sleeping)
>        at java.lang.Thread.sleep(Native Method)
>        at
>
> org.mortbay.jetty.servlet.AbstractSessionManager$SessionScavenger.run(AbstractSessionManager.java:587)
>
> "SessionScavenger" daemon prio=10 tid=0x08ab5c00 nid=0x878 waiting on
> condition [0x8f41e000..0x8f41f040]
>   java.lang.Thread.State: TIMED_WAITING (sleeping)
>        at java.lang.Thread.sleep(Native Method)
>        at
>
> org.mortbay.jetty.servlet.AbstractSessionManager$SessionScavenger.run(AbstractSessionManager.java:587)
>
> "regionserver/0.0.0.0:60020.leaseChecker" prio=10 tid=0x086d5800 nid=0x877
> waiting on condition [0x8f476000..0x8f476ec0]
>   java.lang.Thread.State: TIMED_WAITING (parking)
>        at sun.misc.Unsafe.park(Native Method)
>        - parking to wait for  <0x93d0e980> (a
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
>        at
> java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:198)
>        at
>
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:1963)
>        at java.util.concurrent.DelayQueue.poll(DelayQueue.java:201)
>        at org.apache.hadoop.hbase.Leases.run(Leases.java:78)
>
> "regionserver/0.0.0.0:60020.majorCompactionChecker" daemon prio=10
> tid=0x086d4000 nid=0x876 waiting on condition [0x8f4c7000..0x8f4c7f40]
>   java.lang.Thread.State: TIMED_WAITING (sleeping)
>        at java.lang.Thread.sleep(Native Method)
>        at org.apache.hadoop.hbase.util.Sleeper.sleep(Sleeper.java:74)
>        at org.apache.hadoop.hbase.Chore.run(Chore.java:72)
>
> "LeaseChecker" daemon prio=10 tid=0x0837cc00 nid=0x870 waiting on condition
> [0x8f6ad000..0x8f6ae040]
>   java.lang.Thread.State: TIMED_WAITING (sleeping)
>        at java.lang.Thread.sleep(Native Method)
>        at
> org.apache.hadoop.hdfs.DFSClient$LeaseChecker.run(DFSClient.java:979)
>        at java.lang.Thread.run(Thread.java:619)
>
> "DataStreamer for file
> /hbase/log_192.168.1.13_1235129194745_60020/hlog.dat.1235129195649 block
> blk_7545556036225037274_1820952" daemon prio=10 tid=0x08375400 nid=0x86f in
> Object.wait() [0x8f6fe000..0x8f6feec0]
>   java.lang.Thread.State: TIMED_WAITING (on object monitor)
>        at java.lang.Object.wait(Native Method)
>        at
>
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2166)
>        - locked <0x93d0ea10> (a java.util.LinkedList)
>
> "DestroyJavaVM" prio=10 tid=0x0805a000 nid=0x855 waiting on condition
> [0x00000000..0xb7dd2090]
>   java.lang.Thread.State: RUNNABLE
>
> "regionserver/0.0.0.0:60020" prio=10 tid=0x085b2400 nid=0x863 in
> Object.wait() [0x8f7f7000..0x8f7f7f40]
>   java.lang.Thread.State: WAITING (on object monitor)
>        at java.lang.Object.wait(Native Method)
>        at java.lang.Object.wait(Object.java:485)
>        at
>
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.flushInternal(DFSClient.java:3015)
>        - locked <0x93d0ea10> (a java.util.LinkedList)
>        - locked <0x93d0ec58> (a
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream)
>        at
>
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.closeInternal(DFSClient.java:3104)
>        - locked <0x93d0ec58> (a
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream)
>        at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.close(DFSClient.java:3053)
>        at
>
> org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:59)
>        at
> org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:79)
>        at
> org.apache.hadoop.io.SequenceFile$Writer.close(SequenceFile.java:959)
>        - locked <0x93d50d20> (a org.apache.hadoop.io.SequenceFile$Writer)
>        at org.apache.hadoop.hbase.regionserver.HLog.close(HLog.java:421)
>        - locked <0x93d4ce00> (a java.lang.Integer)
>        at
> org.apache.hadoop.hbase.regionserver.HLog.closeAndDelete(HLog.java:404)
>        at
>
> org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:373)
>        at java.lang.Thread.run(Thread.java:619)
>
> "Low Memory Detector" daemon prio=10 tid=0x080dd800 nid=0x85e runnable
> [0x00000000..0x00000000]
>   java.lang.Thread.State: RUNNABLE
>
> "CompilerThread1" daemon prio=10 tid=0x080dc400 nid=0x85d waiting on
> condition [0x00000000..0x8fbc1588]
>   java.lang.Thread.State: RUNNABLE
>
> "CompilerThread0" daemon prio=10 tid=0x080d9c00 nid=0x85c waiting on
> condition [0x00000000..0x8fc42608]
>   java.lang.Thread.State: RUNNABLE
>
> "Signal Dispatcher" daemon prio=10 tid=0x080d8800 nid=0x85b runnable
> [0x00000000..0x8fc93e80]
>   java.lang.Thread.State: RUNNABLE
>
> "Surrogate Locker Thread (CMS)" daemon prio=10 tid=0x080d7800 nid=0x85a
> waiting on condition [0x00000000..0x8fce522c]
>   java.lang.Thread.State: RUNNABLE
>
> "Finalizer" daemon prio=10 tid=0x080bbc00 nid=0x859 in Object.wait()
> [0x8fd7b000..0x8fd7bec0]
>   java.lang.Thread.State: WAITING (on object monitor)
>        at java.lang.Object.wait(Native Method)
>        at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:116)
>        - locked <0x93d0bdc0> (a java.lang.ref.ReferenceQueue$Lock)
>        at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:132)
>        at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:159)
>
> "Reference Handler" daemon prio=10 tid=0x080ba800 nid=0x858 in
> Object.wait()
> [0x8fdcc000..0x8fdccf40]
>   java.lang.Thread.State: WAITING (on object monitor)
>        at java.lang.Object.wait(Native Method)
>        at java.lang.Object.wait(Object.java:485)
>        at java.lang.ref.Reference$ReferenceHandler.run(Reference.java:116)
>        - locked <0x93d0c110> (a java.lang.ref.Reference$Lock)
>
> "VM Thread" prio=10 tid=0x080b7800 nid=0x857 runnable
>
> "Concurrent Mark-Sweep GC Thread" prio=10 tid=0x0806c800 nid=0x856 runnable
> "VM Periodic Task Thread" prio=10 tid=0x080df000 nid=0x85f waiting on
> condition
>
> JNI global references: 922
>
> This force me to kill (-KILL) the region server, and restart the hdfs (if I
> don't restart hdfs some regions are not available because the block is
> still
> not available).
>
> Note that the Hbase master does not ends as well, and I have to kill
> (-KILL)
> it as well. Here is the thread dump of the HMaster process:
>
> ### Master thread dump:
>
> Full thread dump Java HotSpot(TM) Server VM (10.0-b23 mixed mode):
>
> "Attach Listener" daemon prio=10 tid=0x0805a000 nid=0x2326 waiting on
> condition [0x00000000..0x00000000]
>   java.lang.Thread.State: RUNNABLE
>
> "SIGTERM handler" daemon prio=10 tid=0x089f0c00 nid=0x2303 waiting for
> monitor entry [0x8a6a5000..0x8a6a5ec0]
>   java.lang.Thread.State: BLOCKED (on object monitor)
>        at java.lang.Shutdown.exit(Shutdown.java:178)
>        - waiting to lock <0xb0beed60> (a java.lang.Class for
> java.lang.Shutdown)
>        at java.lang.Terminator$1.handle(Terminator.java:35)
>        at sun.misc.Signal$1.run(Signal.java:195)
>        at java.lang.Thread.run(Thread.java:619)
>
> "Thread-1" prio=10 tid=0x8f666000 nid=0x653c waiting on condition
> [0x8f1fe000..0x8f1fef40]
>   java.lang.Thread.State: TIMED_WAITING (sleeping)
>        at java.lang.Thread.sleep(Native Method)
>        at org.apache.hadoop.ipc.Client.stop(Client.java:667)
>        at org.apache.hadoop.ipc.RPC$ClientCache.stopClient(RPC.java:189)
>        at org.apache.hadoop.ipc.RPC$ClientCache.access$400(RPC.java:138)
>        at org.apache.hadoop.ipc.RPC$Invoker.close(RPC.java:229)
>        - locked <0x938a1b18> (a org.apache.hadoop.ipc.RPC$Invoker)
>        at org.apache.hadoop.ipc.RPC$Invoker.access$500(RPC.java:196)
>        at org.apache.hadoop.ipc.RPC.stopProxy(RPC.java:353)
>        at org.apache.hadoop.hdfs.DFSClient.close(DFSClient.java:213)
>        - locked <0x93878470> (a org.apache.hadoop.hdfs.DFSClient)
>        at
>
> org.apache.hadoop.hdfs.DistributedFileSystem.close(DistributedFileSystem.java:243)
>        at
> org.apache.hadoop.fs.FileSystem$Cache.closeAll(FileSystem.java:1413)
>        - locked <0x937d3c00> (a org.apache.hadoop.fs.FileSystem$Cache)
>        at org.apache.hadoop.fs.FileSystem.closeAll(FileSystem.java:236)
>        at
> org.apache.hadoop.fs.FileSystem$ClientFinalizer.run(FileSystem.java:221)
>        - locked <0x937b8a78> (a
> org.apache.hadoop.fs.FileSystem$ClientFinalizer)
>
> "DestroyJavaVM" prio=10 tid=0x8f66c800 nid=0x36e9 in Object.wait()
> [0xb7d7d000..0xb7d7e0e0]
>   java.lang.Thread.State: WAITING (on object monitor)
>        at java.lang.Object.wait(Native Method)
>        at java.lang.Thread.join(Thread.java:1143)
>        - locked <0x937b8a78> (a
> org.apache.hadoop.fs.FileSystem$ClientFinalizer)
>        at java.lang.Thread.join(Thread.java:1196)
>        at
> java.lang.ApplicationShutdownHooks.run(ApplicationShutdownHooks.java:79)
>        at java.lang.Shutdown.runHooks(Shutdown.java:89)
>        at java.lang.Shutdown.sequence(Shutdown.java:133)
>        at java.lang.Shutdown.shutdown(Shutdown.java:200)
>        - locked <0xb0beed60> (a java.lang.Class for java.lang.Shutdown)
>
> "Low Memory Detector" daemon prio=10 tid=0x8fe03800 nid=0x36f4 runnable
> [0x00000000..0x00000000]
>   java.lang.Thread.State: RUNNABLE
>
> "CompilerThread1" daemon prio=10 tid=0x8fe01c00 nid=0x36f3 waiting on
> condition [0x00000000..0x8f9ee488]
>   java.lang.Thread.State: RUNNABLE
>
> "CompilerThread0" daemon prio=10 tid=0x8fe00800 nid=0x36f2 waiting on
> condition [0x00000000..0x8fa6f508]
>   java.lang.Thread.State: RUNNABLE
>
> "Signal Dispatcher" daemon prio=10 tid=0x0813e800 nid=0x36f1 runnable
> [0x00000000..0x8fac0d80]
>   java.lang.Thread.State: RUNNABLE
>
> "Surrogate Locker Thread (CMS)" daemon prio=10 tid=0x0813d800 nid=0x36f0
> waiting on condition [0x00000000..0x00000000]
>   java.lang.Thread.State: RUNNABLE
>
> "Finalizer" daemon prio=10 tid=0x08125000 nid=0x36ef in Object.wait()
> [0x8fba1000..0x8fba1dc0]
>   java.lang.Thread.State: WAITING (on object monitor)
>        at java.lang.Object.wait(Native Method)
>        at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:116)
>        - locked <0x937bafd8> (a java.lang.ref.ReferenceQueue$Lock)
>        at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:132)
>        at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:159)
>
> "Reference Handler" daemon prio=10 tid=0x08120c00 nid=0x36ee in
> Object.wait() [0x8fbf2000..0x8fbf2e40]
>   java.lang.Thread.State: WAITING (on object monitor)
>        at java.lang.Object.wait(Native Method)
>        at java.lang.Object.wait(Object.java:485)
>        at java.lang.ref.Reference$ReferenceHandler.run(Reference.java:116)
>        - locked <0x937baf18> (a java.lang.ref.Reference$Lock)
>
> "VM Thread" prio=10 tid=0x0811d800 nid=0x36ed runnable
>
> "Gang worker#0 (Parallel GC Threads)" prio=10 tid=0x0805e000 nid=0x36ea
> runnable
>
> "Gang worker#1 (Parallel GC Threads)" prio=10 tid=0x0805f000 nid=0x36eb
> runnable
>
> "Concurrent Mark-Sweep GC Thread" prio=10 tid=0x080cd000 nid=0x36ec
> runnable
> "VM Periodic Task Thread" prio=10 tid=0x8fe05000 nid=0x36f5 waiting on
> condition
>
> JNI global references: 1044
>
> When I restart dfs, the log show me that the concerned block is marked as
> invalidate, and it is therefore deleted:
>
> #### Namenode log
> 2009-02-23 10:33:49,245 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
> NameSystem.processReport: block blk_7545556036225037274_1820952 on
> 192.168.1.10:50010 size 27895296 does not
>  belong to any file.
> 2009-02-23 10:33:49,245 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
> NameSystem.addToInvalidates: blk_7545556036225037274 is added to invalidSet
> of 192.168.1.10:50010
>
> Finally, when I restart HBase, I can see some of these message in region
> server log, and I have some data lost.
>
> #### Region server log:
> 2009-02-23 10:38:55,831 WARN org.apache.hadoop.hbase.regionserver.HStore:
> Exception processing reconstruction log
> hdfs://lab5-2:9000/hbase/proxy-D-0.3/510178748/oldlogfile.log opening
> [B@ff65bf -- continuing.  Probably lack-of-HADOOP-1700 causing DATA LOSS!
> java.io.EOFException
>        at java.io.DataInputStream.readFully(DataInputStream.java:180)
>        at
>
> org.apache.hadoop.hbase.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:70)
>        at
>
> org.apache.hadoop.hbase.io.DataOutputBuffer.write(DataOutputBuffer.java:116)
>        at
> org.apache.hadoop.hbase.io.SequenceFile$Reader.next(SequenceFile.java:1944)
>        at
> org.apache.hadoop.hbase.io.SequenceFile$Reader.next(SequenceFile.java:1844)
>        at
> org.apache.hadoop.hbase.io.SequenceFile$Reader.next(SequenceFile.java:1890)
>        at
>
> org.apache.hadoop.hbase.regionserver.HStore.doReconstructionLog(HStore.java:352)
>        at
>
> org.apache.hadoop.hbase.regionserver.HStore.runReconstructionLog(HStore.java:297)
>        at
> org.apache.hadoop.hbase.regionserver.HStore.<init>(HStore.java:237)
>        at
>
> org.apache.hadoop.hbase.regionserver.HRegion.instantiateHStore(HRegion.java:1764)
>        at
> org.apache.hadoop.hbase.regionserver.HRegion.initialize(HRegion.java:276)
>        at
>
> org.apache.hadoop.hbase.regionserver.HRegionServer.instantiateRegion(HRegionServer.java:1367)
>        at
>
> org.apache.hadoop.hbase.regionserver.HRegionServer.openRegion(HRegionServer.java:1338)
>        at
>
> org.apache.hadoop.hbase.regionserver.HRegionServer$Worker.run(HRegionServer.java:1253)
>        at java.lang.Thread.run(Thread.java:619)
>
> I have no idea what exactly happen which yield to this problem, but it is
> certainly due because of my datanode became unresponsive for a while.
>
>
> ==== 3. RS leases / JVM setting (troubleshooting section 7) =====
>
> A useful note for people who suffers of swaping and have region servers
> that
> ends when they miss a lease to contact the master.  In my case the delay
> can
> be up to 15 min when my system become overloaded because of IO during a
> hard
> swaping task. But even if the swap slows down the whole system, it is not a
> reason for the timer to not fire as soon they have a bit of CPU available,
> the reason of this is that the swap happen at garbage collector time of the
> region server JVM.
> The 'default' garbage collector of the JVM fills its entire space until it
> starts a 'full' GC process in the 'old space' (see. above article for
> details), then it blocks all the threads during GC, timers as well.
> In my case I have many of this memory space in swap partition, and the GC
> yield therefore to intensive swap and takes place during 10-15 minutes when
> the region server is 'paused' and the leases miss. The problem is obviously
> that the weakness of my hardware does not fit the minimal requirement for
> HBase, nevertheless once can enable other GC implementation in HotSpot JVM,
> e.g. using the 'concurent' garbage collector (-XX:+UseConcMarkSweepGC)
> helps
> a lot since it does not stop the threads when it takes place. Unfortuately,
> even with this setting a 'full' gc happen sometimes if the memory become
> too
> fragmented, which yield to pause the jvm, swap and eventually make RS miss
> the lease. Anyway it happens less frequently
>
> See http://java.sun.com/docs/hotspot/gc1.4.2/faq.html for more info about
> JVM GC, and how to minimize 'full' gc events.
>
> Have a nice day
> -- Jean-Adrien
>
> --
> View this message in context:
> http://www.nabble.com/HBase-0.19.0%3A-Xcievers---shutdown---JVM-GC-settings-tp22158436p22158436.html
> Sent from the HBase User mailing list archive at Nabble.com.
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message