hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Enis Soztutar (JIRA)" <j...@apache.org>
Subject [jira] [Created] (HBASE-7385) Do not abort regionserver if StoreFlusher.flushCache() fails
Date Wed, 19 Dec 2012 00:36:14 GMT
Enis Soztutar created HBASE-7385:
------------------------------------

             Summary: Do not abort regionserver if StoreFlusher.flushCache() fails
                 Key: HBASE-7385
                 URL: https://issues.apache.org/jira/browse/HBASE-7385
             Project: HBase
          Issue Type: Improvement
          Components: io, regionserver
            Reporter: Enis Soztutar
            Assignee: Enis Soztutar


A rare NN failover may cause RS abort, in the following sequence of events: 
 - RS tries to flush the memstore
 - Create a file, start block, and acquire a lease
 - Block is complete, lease removed, but before we send the RPC response back to the client,
NN is killed.
 - New NN comes up, client retries the block complete again, the new NN throws lease expired
since the block was already complete.
 - RS receives the exception, and aborts.

This is actually a NN+DFSClient issue that, the dfs client from RS does not receive the rpc
response about the block close, and upon retry on the new NN, it gets the exception, since
the file was already closed. However, although this is DFS client specific, we can also make
RS more resilient by not aborting the RS upon exception from the flushCache(). We can change
StoreFlusher so that: 

StoreFlusher.prepare() will become idempotent (so will Memstore.snapshot())
StoreFlusher.flushCache() will throw with IOException upon DFS exception, but we catch IOException,
and just abort the flush request (not RS).
StoreFlusher.commit() still cause RS abort on exception. This is also debatable. If dfs is
alive, and we can undo the flush changes, than we should not abort. 

logs: 
{code}
org.apache.hadoop.hbase.DroppedSnapshotException: region: loadtest_ha,e6666658,1355820729877.298bcbd550b80507a379fe67eefbe5ea.
	at org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:1485)
	at org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:1364)
	at org.apache.hadoop.hbase.regionserver.HRegion.doClose(HRegion.java:896)
	at org.apache.hadoop.hbase.regionserver.HRegion.close(HRegion.java:845)
	at org.apache.hadoop.hbase.regionserver.handler.CloseRegionHandler.process(CloseRegionHandler.java:119)
	at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:169)
	at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
	at java.lang.Thread.run(Thread.java:662)
Caused by: org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException:
No lease on /apps/hbase/data/loadtest_ha/298bcbd550b80507a379fe67eefbe5ea/.tmp/5cf8951ee12449ce8e4e6dd0bf1645c2
File is not open for writing. [Lease.  Holder: DFSClient_hb_rs_hrt23n28.cc1.ygridcore.net,60020,1355813552066_203591774_25,
pendingcreates: 1]
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:1724)
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:1707)
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFileInternal(FSNamesystem.java:1762)
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFile(FSNamesystem.java:1750)
	at org.apache.hadoop.hdfs.server.namenode.NameNode.complete(NameNode.java:779)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
	at java.lang.reflect.Method.invoke(Method.java:597)
	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:578)
	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1393)
	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1389)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:396)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1136)
	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1387)

	at org.apache.hadoop.ipc.Client.call(Client.java:1107)
	at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:229)
	at $Proxy10.complete(Unknown Source)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
	at java.lang.reflect.Method.invoke(Method.java:597)
	at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:85)
	at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:62)
	at $Proxy10.complete(Unknown Source)
	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.closeInternal(DFSClient.java:4087)
	at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.close(DFSClient.java:3988)
	at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:61)
	at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:86)
	at org.apache.hadoop.hbase.io.hfile.AbstractHFileWriter.finishClose(AbstractHFileWriter.java:255)
	at org.apache.hadoop.hbase.io.hfile.HFileWriterV2.close(HFileWriterV2.java:432)
	at org.apache.hadoop.hbase.regionserver.StoreFile$Writer.close(StoreFile.java:1214)
	at org.apache.hadoop.hbase.regionserver.Store.internalFlushCache(Store.java:762)
	at org.apache.hadoop.hbase.regionserver.Store.flushCache(Store.java:674)
	at org.apache.hadoop.hbase.regionserver.Store.access$400(Store.java:109)
	at org.apache.hadoop.hbase.regionserver.Store$StoreFlusherImpl.flushCache(Store.java:2286)
	at org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:1460)
{code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message