hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "gaojinchao (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-3722) A lot of data is lost when name node crashed
Date Tue, 12 Apr 2011 10:53:05 GMT

    [ https://issues.apache.org/jira/browse/HBASE-3722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13018791#comment-13018791
] 

gaojinchao commented on HBASE-3722:
-----------------------------------

In my cluster :
1.HDFS cluster is HA namenode( ANN and BNN)
2.HBASE Version 0.90.1:
  Active Hmaster: C4C1 
  Backup Hmaster: C4C2
  Region server: C4C3,C4C4,C4C5,...

operation:
1.ANN crashed and BNN becomed Active(that needs some time)
2.Some region server crashed(eg:C4C3 has meta table) that Hbase client is putting into data
and some Region server is ok.
3.Hmaster split hlog failed and skip it.
4.BNN had been active and Hmaster had finished processed shutdown event.
5.A lots of data is lost that region server had crashed.


log as:
14:57:58 C4C3 shutdow itself  because of ANN crashed.
skip splitlog and ressigned Meta table.  

2011-04-12 14:57:58,782 INFO org.apache.hadoop.hbase.master.handler.ServerShutdownHandler:
Splitting logs for C4C3.site,60020,1302590910433
2011-04-12 14:57:59,790 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: C4C1/157.5.100.1:9000.
Already tried 0 time(s).
....
2011-04-12 14:58:08,793 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: C4C1/157.5.100.1:9000.
Already tried 9 time(s).
2011-04-12 14:58:08,795 ERROR org.apache.hadoop.hbase.master.MasterFileSystem: Failed splitting
hdfs://C4C1:9000/hbase/.logs/C4C3.site,60020,1302590910433
java.net.ConnectException: Call to C4C1/157.5.100.1:9000 failed on connection exception: java.net.ConnectException:
Connection refused
2011-04-12 14:58:08,805 INFO org.apache.hadoop.hbase.catalog.RootLocationEditor: Unsetting
ROOT region location in ZooKeeper
2011-04-12 14:58:08,880 INFO org.apache.hadoop.hbase.catalog.CatalogTracker: Failed verification
of .META.,,1 at address=C4C3.site:60020; java.net.ConnectException: Connection refused
2011-04-12 14:58:08,880 INFO org.apache.hadoop.hbase.catalog.CatalogTracker: Current cached
META location is not valid, resetting

Hmaster finished process shutdown event when BNN becomes active and meta table ressigned 

2011-04-12 15:00:31,681 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: C4C1/157.5.100.1:9000.
Already tried 0 time(s).
2011-04-12 15:00:32,682 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: C4C1/157.5.100.1:9000.
Already tried 1 time(s).
2011-04-12 15:00:40,698 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in
transition timed out:  .META.,,1.1028785192 state=OPENING, ts=1302591600701
2011-04-12 15:00:40,699 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has
been OPENING for too long, reassigning region=.META.,,1.1028785192
2011-04-12 15:00:40,709 INFO org.apache.hadoop.hbase.master.AssignmentManager: Successfully
transitioned region=.META.,,1.1028785192 into OFFLINE and forcing a new assignment
2011-04-12 15:00:40,712 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in
transition timed out:  -ROOT-,,0.70236052 state=OPENING, ts=1302591600718
2011-04-12 15:00:40,712 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has
been OPENING for too long, reassigning region=-ROOT-,,0.70236052
2011-04-12 15:00:40,725 INFO org.apache.hadoop.hbase.master.AssignmentManager: Successfully
transitioned region=-ROOT-,,0.70236052 into OFFLINE and forcing a new assignment
2011-04-12 15:00:40,892 INFO org.apache.hadoop.hbase.zookeeper.MetaNodeTracker: Detected completed
assignment of META, notifying catalog tracker
2011-04-12 15:00:45,870 INFO org.apache.hadoop.hbase.master.handler.ServerShutdownHandler:
Reassigning 0 region(s) that C4C3.site,60020,1302590910433 was carrying (skipping 0 regions(s)
that are already in transition)
2011-04-12 15:00:45,870 INFO org.apache.hadoop.hbase.master.handler.ServerShutdownHandler:
Finished processing of shutdown of C4C3.site,60020,1302590910433



It has been lost that the Hlog is skipped if Hmaster don't restart when NN recovered.
so I think Hmaster should shutdown itslef when NN crashed.
like as region server roll Hlog shutdowns itself when it catchs any IO exception.

>  A lot of data is lost when name node crashed
> ---------------------------------------------
>
>                 Key: HBASE-3722
>                 URL: https://issues.apache.org/jira/browse/HBASE-3722
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 0.90.1
>            Reporter: gaojinchao
>         Attachments: HmasterFilesystem_PatchV1.patch
>
>
> I'm not sure exactly what arose it. there is some split failed logs .
> the master should shutdown itself when the HDFS is crashed.
>  The logs is :
>  2011-03-22 13:21:55,056 WARN 
>  org.apache.hadoop.hbase.master.LogCleaner: Error while cleaning the 
>  logs
>  java.net.ConnectException: Call to C4C1/157.5.100.1:9000 failed on connection exception:
java.net.ConnectException: Connection refused
>          at org.apache.hadoop.ipc.Client.wrapException(Client.java:844)
>          at org.apache.hadoop.ipc.Client.call(Client.java:820)
>          at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:221)
>          at $Proxy5.getListing(Unknown Source)
>          at sun.reflect.GeneratedMethodAccessor9.invoke(Unknown Source)
>          at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>          at java.lang.reflect.Method.invoke(Method.java:597)
>          at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
>          at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
>          at $Proxy5.getListing(Unknown Source)
>          at org.apache.hadoop.hdfs.DFSClient.listPaths(DFSClient.java:614)
>          at org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:252)
>          at org.apache.hadoop.hbase.master.LogCleaner.chore(LogCleaner.java:121)
>          at org.apache.hadoop.hbase.Chore.run(Chore.java:66)
>          at 
>  org.apache.hadoop.hbase.master.LogCleaner.run(LogCleaner.java:154)
>  Caused by: java.net.ConnectException: Connection refused
>          at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
>          at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:574)
>          at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
>          at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:408)
>          at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:332)
>          at org.apache.hadoop.ipc.Client$Connection.access$2000(Client.java:202)
>          at org.apache.hadoop.ipc.Client.getConnection(Client.java:943)
>          at org.apache.hadoop.ipc.Client.call(Client.java:788)
>          ... 13 more
>  2011-03-22 13:21:56,056 INFO org.apache.hadoop.ipc.Client: Retrying connect to server:
C4C1/157.5.100.1:9000. Already tried 0 time(s).
>  2011-03-22 13:21:57,057 INFO org.apache.hadoop.ipc.Client: Retrying connect to server:
C4C1/157.5.100.1:9000. Already tried 1 time(s).
>  2011-03-22 13:21:58,057 INFO org.apache.hadoop.ipc.Client: Retrying connect to server:
C4C1/157.5.100.1:9000. Already tried 2 time(s).
>  2011-03-22 13:21:59,057 INFO org.apache.hadoop.ipc.Client: Retrying connect to server:
C4C1/157.5.100.1:9000. Already tried 3 time(s).
>  2011-03-22 13:22:00,058 INFO org.apache.hadoop.ipc.Client: Retrying connect to server:
C4C1/157.5.100.1:9000. Already tried 4 time(s).
>  2011-03-22 13:22:01,058 INFO org.apache.hadoop.ipc.Client: Retrying connect to server:
C4C1/157.5.100.1:9000. Already tried 5 time(s).
>  2011-03-22 13:22:02,059 INFO org.apache.hadoop.ipc.Client: Retrying connect to server:
C4C1/157.5.100.1:9000. Already tried 6 time(s).
>  2011-03-22 13:22:03,059 INFO org.apache.hadoop.ipc.Client: Retrying connect to server:
C4C1/157.5.100.1:9000. Already tried 7 time(s).
>  2011-03-22 13:22:04,059 INFO org.apache.hadoop.ipc.Client: Retrying connect to server:
C4C1/157.5.100.1:9000. Already tried 8 time(s).
>  2011-03-22 13:22:05,060 INFO org.apache.hadoop.ipc.Client: Retrying connect to server:
C4C1/157.5.100.1:9000. Already tried 9 time(s).
>  2011-03-22 13:22:05,060 ERROR 
>  org.apache.hadoop.hbase.master.MasterFileSystem: Failed splitting 
>  hdfs://C4C1:9000/hbase/.logs/C4C9.site,60020,1300767633398
>  java.net.ConnectException: Call to C4C1/157.5.100.1:9000 failed on connection exception:
java.net.ConnectException: Connection refused
>          at org.apache.hadoop.ipc.Client.wrapException(Client.java:844)
>          at org.apache.hadoop.ipc.Client.call(Client.java:820)
>          at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:221)
>          at $Proxy5.getFileInfo(Unknown Source)
>          at sun.reflect.GeneratedMethodAccessor4.invoke(Unknown Source)
>          at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>          at java.lang.reflect.Method.invoke(Method.java:597)
>          at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
>          at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
>          at $Proxy5.getFileInfo(Unknown Source)
>          at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:623)
>          at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:461)
>          at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:690)
>          at org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.splitLog(HLogSplitter.java:177)
>          at org.apache.hadoop.hbase.master.MasterFileSystem.splitLog(MasterFileSystem.java:196)
>          at org.apache.hadoop.hbase.master.handler.ServerShutdownHandler.process(ServerShutdownHandler.java:95)
>          at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:151)
>          at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>          at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>          at java.lang.Thread.run(Thread.java:662)
>  Caused by: java.net.ConnectException: Connection refused
>          at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
>          at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:574)
>          at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
>          at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:408)
>          at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:332)
>          at org.apache.hadoop.ipc.Client$Connection.access$2000(Client.java:202)
>          at org.apache.hadoop.ipc.Client.getConnection(Client.java:943)
>          at org.apache.hadoop.ipc.Client.call(Client.java:788)
>          ... 18 more
>  2011-03-22 13:22:45,600 INFO org.apache.hadoop.ipc.Client: Retrying connect to server:
C4C1/157.5.100.1:9000. Already tried 0 time(s).
>  2011-03-22 13:22:46,600 INFO org.apache.hadoop.ipc.Client: Retrying connect to server:
C4C1/157.5.100.1:9000. Already tried 1 time(s).
>  2011-03-22 13:22:47,601 INFO org.apache.hadoop.ipc.Client: Retrying connect to server:
C4C1/157.5.100.1:9000. Already tried 2 time(s).
>  2011-03-22 13:22:48,601 INFO org.apache.hadoop.ipc.Client: Retrying connect to server:
C4C1/157.5.100.1:9000. Already tried 3 time(s).
>  2011-03-22 13:22:49,601 INFO org.apache.hadoop.ipc.Client: Retrying connect to server:
C4C1/157.5.100.1:9000. Already tried 4 time(s).
>  2011-03-22 13:22:50,602 INFO org.apache.hadoop.ipc.Client: Retrying connect to server:
C4C1/157.5.100.1:9000. Already tried 5 time(s).
>  2011-03-22 13:22:51,602 INFO org.apache.hadoop.ipc.Client: Retrying connect to server:
C4C1/157.5.100.1:9000. Already tried 6 time(s).
>  2011-03-22 13:22:52,602 INFO org.apache.hadoop.ipc.Client: Retrying connect to server:
C4C1/157.5.100.1:9000. Already tried 7 time(s).
>  2011-03-22 13:22:53,603 INFO org.apache.hadoop.ipc.Client: Retrying connect to server:
C4C1/157.5.100.1:9000. Already tried 8 time(s).
>  2011-03-22 13:22:54,603 INFO org.apache.hadoop.ipc.Client: Retrying connect to server:
C4C1/157.5.100.1:9000. Already tried 9 time(s).
>  2011-03-22 13:22:54,603 WARN 
>  org.apache.hadoop.hbase.master.LogCleaner: Error while cleaning the 
>  logs
>  java.net.ConnectException: Call to C4C1/157.5.100.1:9000 failed on connection exception:
java.net.ConnectException: Connection refused
>          at org.apache.hadoop.ipc.Client.wrapException(Client.java:844)
>          at org.apache.hadoop.ipc.Client.call(Client.java:820)
>          at org.apache.hadoop.ipc.RPC$Invok

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message