hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "stack (JIRA)" <j...@apache.org>
Subject [jira] Resolved: (HBASE-642) Splitting log in a hostile environment -- bad hdfs -- we drop write-ahead-log edits
Date Wed, 28 May 2008 18:15:45 GMT

     [ https://issues.apache.org/jira/browse/HBASE-642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

stack resolved HBASE-642.
-------------------------

    Resolution: Fixed
      Assignee: stack

Committed branch and trunk.

Changed the try/catches around the reconstruction log replay so that we only catch and pass
on the EOFExceptions.  If an IOE, just report it and let the region fail on each open attempt
until human intervention.  If IOE playing reconstruction log, there is something badly wrong
probably in hdfs.

{code}
Index: src/java/org/apache/hadoop/hbase/HStore.java
===================================================================
--- src/java/org/apache/hadoop/hbase/HStore.java        (revision 660681)
+++ src/java/org/apache/hadoop/hbase/HStore.java        (working copy)
@@ -825,13 +825,21 @@
     
     try {
       doReconstructionLog(reconstructionLog, maxSeqId, reporter);
+    } catch (EOFException e) {
+      // Presume we got here because of lack of HADOOP-1700; for now keep going
+      // but this is probably not what we want long term.  If we got here there
+      // has been data-loss
+      LOG.warn("Exception processing reconstruction log " + reconstructionLog +
+        " opening " + this.storeName +
+        " -- continuing.  Probably lack-of-HADOOP-1700 causing DATA LOSS!", e);
     } catch (IOException e) {
-      // Presume we got here because of some HDFS issue or because of a lack of
-      // HADOOP-1700; for now keep going but this is probably not what we want
-      // long term.  If we got here there has been data-loss
+      // Presume we got here because of some HDFS issue. Don't just keep going.
+      // Fail to open the HStore.  Probably means we'll fail over and over
+      // again until human intervention but alternative has us skipping logs
+      // and losing edits: HBASE-642.
       LOG.warn("Exception processing reconstruction log " + reconstructionLog +
-        " opening " + this.storeName +
-        " -- continuing.  Probably DATA LOSS!", e);
+        " opening " + this.storeName, e);
+      throw e;
     }
{code}


> Splitting log in a hostile environment -- bad hdfs -- we drop write-ahead-log edits
> -----------------------------------------------------------------------------------
>
>                 Key: HBASE-642
>                 URL: https://issues.apache.org/jira/browse/HBASE-642
>             Project: Hadoop HBase
>          Issue Type: Bug
>            Reporter: stack
>            Assignee: stack
>             Fix For: 0.1.3, 0.2.0
>
>
> The master has noticed that the regionserver that was carrying the .META. region among
others has died and it goes to split its logs:
> {code}
> 2008-05-18 19:58:01,292 DEBUG org.apache.hadoop.hbase.HLog: Splitting 0 of 2: hdfs://domU-12-31-38-00-D4-21:9000/hbase/log_10.254.30.79_1210899434766_60020/hlog.dat.017
> 2008-05-18 19:58:01,408 DEBUG org.apache.hadoop.hbase.HLog: Creating new log file writer
for path hdfs://domU-12-31-38-00-D4-21:9000/hbase/categories/1060231198/oldlogfile.log and
region categories,2864153,1211005494348
> 2008-05-18 19:58:01,573 DEBUG org.apache.hadoop.hbase.HLog: Creating new log file writer
for path hdfs://domU-12-31-38-00-D4-21:9000/hbase/categories/297165731/oldlogfile.log and
region categories,5992242,1211005494349
> {code}
> Master can't write hdfs for some reason so can't do the log split:
> {code}
> 2008-05-18 19:59:15,265 INFO org.apache.hadoop.dfs.DFSClient: Exception in createBlockOutputStream
java.net.SocketTimeoutException: Read timed out
> 2008-05-18 19:59:15,266 INFO org.apache.hadoop.dfs.DFSClient: Abandoning block blk_7852777250062244002
> 2008-05-18 19:59:15,268 INFO org.apache.hadoop.dfs.DFSClient: Waiting to find target
node: 10.252.219.207:50010
> 2008-05-18 19:59:39,945 INFO org.apache.hadoop.ipc.Client: Retrying connect to server:
/10.254.30.79:60020. Already tried 6 time(s).
> 2008-05-18 20:00:21,274 INFO org.apache.hadoop.dfs.DFSClient: Exception in createBlockOutputStream
java.io.IOException: Bad connect ack with firstBadLink 10.254.30.79:50010
> 2008-05-18 20:00:21,275 INFO org.apache.hadoop.dfs.DFSClient: Abandoning block blk_7007215478628265924
> 2008-05-18 20:00:21,277 INFO org.apache.hadoop.dfs.DFSClient: Waiting to find target
node: 10.252.219.207:50010
> 2008-05-18 20:00:40,955 INFO org.apache.hadoop.ipc.Client: Retrying connect to server:
/10.254.30.79:60020. Already tried 7 time(s).
> 2008-05-18 20:01:31,178 INFO org.apache.hadoop.dfs.DFSClient: Exception in createBlockOutputStream
java.io.IOException: Bad connect ack with firstBadLink 10.254.30.79:50010
> 2008-05-18 20:01:31,178 INFO org.apache.hadoop.dfs.DFSClient: Abandoning block blk_2374125514769088471
> 2008-05-18 20:01:31,180 INFO org.apache.hadoop.dfs.DFSClient: Waiting to find target
node: 10.252.219.207:50010
> 2008-05-18 20:01:40,145 INFO org.apache.hadoop.dfs.DFSClient: Exception in createBlockOutputStream
java.net.SocketTimeoutException: Read timed out
> 2008-05-18 20:01:40,145 INFO org.apache.hadoop.dfs.DFSClient: Abandoning block blk_-621042589816139684
> 2008-05-18 20:01:40,148 INFO org.apache.hadoop.dfs.DFSClient: Waiting to find target
node: 10.252.219.207:50010
> ..
> {code}
> Weirdly, the above is complaining can't connect to the datanode running on same host
as master.
> Eventually the split fails with:
> {code}
> 2008-05-18 20:24:28,393 WARN org.apache.hadoop.hbase.HMaster: Processing pending operations:
ProcessServerShutdown of 10.254.30.79:60020
> java.io.IOException: java.io.IOException: Could not complete write to file /hbase/categories/1060231198/oldlogfile.log
by DFSClient_520078809
>     at org.apache.hadoop.dfs.NameNode.complete(NameNode.java:343)
>     at sun.reflect.GeneratedMethodAccessor44.invoke(Unknown Source)
>     at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>     at java.lang.reflect.Method.invoke(Method.java:597)
>     at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:409)
>     at org.apache.hadoop.ipc.Server$Handler.run(Server.java:901)
>     at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>     at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
>     at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
>     at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
>     at org.apache.hadoop.hbase.RemoteExceptionHandler.decodeRemoteException(RemoteExceptionHandler.java:82)
>     at org.apache.hadoop.hbase.HMaster.run(HMaster.java:1116)
> 2008-05-18 20:24:28,394 DEBUG org.apache.hadoop.hbase.HMaster: Main processing loop:
ProcessServerShutdown of 10.254.30.79:60020
> 2008-05-18 20:24:28,394 INFO org.apache.hadoop.hbase.HMaster: process shutdown of server
10.254.30.79:60020: logSplit: false, rootRescanned: false, numberOfMetaRegions: 1, onlineMetaRegions.size():
1
> 2008-05-18 20:24:28,395 INFO org.apache.hadoop.hbase.HLog: splitting 1 log(s) in hdfs://domU-12-31-38-00-D4-21:9000/hbase/log_10.254.30.79_1210899434766_60020
> 2008-05-18 20:24:28,395 DEBUG org.apache.hadoop.hbase.HLog: Splitting 0 of 1: hdfs://domU-12-31-38-00-D4-21:9000/hbase/log_10.254.30.79_1210899434766_60020/hlog.dat.018
> 2008-05-18 20:24:28,399 WARN org.apache.hadoop.hbase.HLog: Old log file hdfs://domU-12-31-38-00-D4-21:9000/hbase/categories/297165731/oldlogfile.log
already exists. Copying existing file to new file
> 2008-05-18 20:25:03,105 INFO org.apache.hadoop.ipc.Client: Retrying connect to server:
/10.254.30.79:60020. Already tried 9 time(s).
> 2008-05-18 20:25:28,401 WARN org.apache.hadoop.hbase.HLog: Exception processing hdfs://domU-12-31-38-00-D4-21:9000/hbase/log_10.254.30.79_1210899434766_60020/hlog.dat.018
-- continuing. Possible DATA LOSS!
> java.net.SocketTimeoutException: timed out waiting for rpc response
> ...
> {code}
> And we just move on to the next log -- we notice the earlier attempt at distributing
the edits and try to pick them up (though in this case, the file is likely empty) -- but the
split of new log also fails.
> Though log says 'Possible DATA LOSS!', we keep going . The .META. and other regions are
reassigned and deployed though they are likely missing edits.
> In this catastrophic case, I'd say master should not move and if it has to, go down rather
than reassign regions and try to keep going.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message