hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Thomas Jungblut (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-11681) HBase gives up retries after being throttled by Azure storage
Date Fri, 06 Mar 2015 17:13:39 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-11681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14350588#comment-14350588
] 

Thomas Jungblut commented on HADOOP-11681:
------------------------------------------

[~onpduo] you are right- I was mistaking it with deletes of containers vs. blobs. 

Can you please shed some light on why we archive old WALs? I would assume we can just queue
them up for deletion and delete them at a rate that doesn't cause throttling while it was
splitting logs. 

I know this might be a very "localized" solution to HBase, do you see any better fix than
just changing the retry backoffs?

> HBase gives up retries after being throttled by Azure storage
> -------------------------------------------------------------
>
>                 Key: HADOOP-11681
>                 URL: https://issues.apache.org/jira/browse/HADOOP-11681
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: tools
>            Reporter: Duo Xu
>            Assignee: Duo Xu
>         Attachments: HADOOP-11681.01.patch
>
>
> One of our customers' production HBase clusters was periodically throttled by Azure storage,
when HBase was archiving old WALs. HMaster aborted the region server and tried to restart
it.
> However, since the cluster was still being throttled by Azure storage, the upcoming distributed
log splitting also failed. Sometimes hbase:meta table was on this region server and finally
showed offline, which cause the whole cluster in bad state.
> {code}
> 2015-03-01 18:36:45,623 ERROR org.apache.hadoop.hbase.master.HMaster: Region server workernode4.hbaseproddb4001.f5.internal.cloudapp.net,60020,1424845421044
reported a fatal error:
> ABORTING region server workernode4.hbaseproddb4001.f5.internal.cloudapp.net,60020,1424845421044:
IOE in log roller
> Cause:
> org.apache.hadoop.fs.azure.AzureException: com.microsoft.windowsazure.storage.StorageException:
The server is busy.
> 	at org.apache.hadoop.fs.azurenative.AzureNativeFileSystemStore.rename(AzureNativeFileSystemStore.java:2446)
> 	at org.apache.hadoop.fs.azurenative.AzureNativeFileSystemStore.rename(AzureNativeFileSystemStore.java:2367)
> 	at org.apache.hadoop.fs.azurenative.NativeAzureFileSystem.rename(NativeAzureFileSystem.java:1960)
> 	at org.apache.hadoop.hbase.util.FSUtils.renameAndSetModifyTime(FSUtils.java:1719)
> 	at org.apache.hadoop.hbase.regionserver.wal.FSHLog.archiveLogFile(FSHLog.java:798)
> 	at org.apache.hadoop.hbase.regionserver.wal.FSHLog.cleanOldLogs(FSHLog.java:656)
> 	at org.apache.hadoop.hbase.regionserver.wal.FSHLog.rollWriter(FSHLog.java:593)
> 	at org.apache.hadoop.hbase.regionserver.LogRoller.run(LogRoller.java:97)
> 	at java.lang.Thread.run(Thread.java:745)
> Caused by: com.microsoft.windowsazure.storage.StorageException: The server is busy.
> 	at com.microsoft.windowsazure.storage.StorageException.translateException(StorageException.java:163)
> 	at com.microsoft.windowsazure.storage.core.StorageRequest.materializeException(StorageRequest.java:306)
> 	at com.microsoft.windowsazure.storage.core.ExecutionEngine.executeWithRetry(ExecutionEngine.java:229)
> 	at com.microsoft.windowsazure.storage.blob.CloudBlob.startCopyFromBlob(CloudBlob.java:762)
> 	at org.apache.hadoop.fs.azurenative.StorageInterfaceImpl$CloudBlobWrapperImpl.startCopyFromBlob(StorageInterfaceImpl.java:350)
> 	at org.apache.hadoop.fs.azurenative.AzureNativeFileSystemStore.rename(AzureNativeFileSystemStore.java:2439)
> 	... 8 more
> 2015-03-01 18:43:29,072 ERROR org.apache.hadoop.hbase.executor.EventHandler: Caught throwable
while processing event M_META_SERVER_SHUTDOWN
> java.io.IOException: failed log splitting for workernode13.hbaseproddb4001.f5.internal.cloudapp.net,60020,1424845307901,
will retry
> 	at org.apache.hadoop.hbase.master.handler.MetaServerShutdownHandler.process(MetaServerShutdownHandler.java:71)
> 	at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:128)
> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> 	at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.hadoop.fs.azure.AzureException: com.microsoft.windowsazure.storage.StorageException:
The server is busy.
> 	at org.apache.hadoop.fs.azurenative.AzureNativeFileSystemStore.rename(AzureNativeFileSystemStore.java:2446)
> 	at org.apache.hadoop.fs.azurenative.NativeAzureFileSystem$FolderRenamePending.execute(NativeAzureFileSystem.java:393)
> 	at org.apache.hadoop.fs.azurenative.NativeAzureFileSystem.rename(NativeAzureFileSystem.java:1973)
> 	at org.apache.hadoop.hbase.master.MasterFileSystem.getLogDirs(MasterFileSystem.java:319)
> 	at org.apache.hadoop.hbase.master.MasterFileSystem.splitLog(MasterFileSystem.java:406)
> 	at org.apache.hadoop.hbase.master.MasterFileSystem.splitMetaLog(MasterFileSystem.java:302)
> 	at org.apache.hadoop.hbase.master.MasterFileSystem.splitMetaLog(MasterFileSystem.java:293)
> 	at org.apache.hadoop.hbase.master.handler.MetaServerShutdownHandler.process(MetaServerShutdownHandler.java:64)
> 	... 4 more
> Caused by: com.microsoft.windowsazure.storage.StorageException: The server is busy.
> 	at com.microsoft.windowsazure.storage.StorageException.translateException(StorageException.java:163)
> 	at com.microsoft.windowsazure.storage.core.StorageRequest.materializeException(StorageRequest.java:306)
> 	at com.microsoft.windowsazure.storage.core.ExecutionEngine.executeWithRetry(ExecutionEngine.java:229)
> 	at com.microsoft.windowsazure.storage.blob.CloudBlob.startCopyFromBlob(CloudBlob.java:762)
> 	at org.apache.hadoop.fs.azurenative.StorageInterfaceImpl$CloudBlobWrapperImpl.startCopyFromBlob(StorageInterfaceImpl.java:350)
> 	at org.apache.hadoop.fs.azurenative.AzureNativeFileSystemStore.rename(AzureNativeFileSystemStore.java:2439)
> 	... 11 more
> Sun Mar 01 18:59:51 GMT 2015, org.apache.hadoop.hbase.client.RpcRetryingCaller@aa93ac7,
org.apache.hadoop.hbase.NotServingRegionException: org.apache.hadoop.hbase.NotServingRegionException:
Region hbase:meta,,1 is not online on workernode13.hbaseproddb4001.f5.internal.cloudapp.net,60020,1425235081338
> 	at org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:2676)
> 	at org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:4095)
> 	at org.apache.hadoop.hbase.regionserver.HRegionServer.scan(HRegionServer.java:3076)
> 	at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:28861)
> 	at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2008)
> 	at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:92)
> 	at org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.consumerLoop(SimpleRpcScheduler.java:160)
> 	at org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.access$000(SimpleRpcScheduler.java:38)
> 	at org.apache.hadoop.hbase.ipc.SimpleRpcScheduler$1.run(SimpleRpcScheduler.java:110)
> 	at java.lang.Thread.run(Thread.java:745)
> {code}
> When archiving old WALs, WASB will do rename operation by copying src blob to destination
blob and deleting the src blob. Copy blob is very costly in Azure storage and during Azure
storage gc, it will be highly likely throttled. The throttling by Azure storage usually ends
within 15mins. Current WASB retry policy is exponential retry, but only last at most for 2min.
Short term fix will be adding a more intensive exponential retry when copy blob is throttled.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message