hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "stack (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-6309) [MTTR] Do NN operations outside of the ZK EventThread in SplitLogManager
Date Wed, 04 Jul 2012 07:54:35 GMT

    [ https://issues.apache.org/jira/browse/HBASE-6309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13406345#comment-13406345
] 

stack commented on HBASE-6309:
------------------------------

What about the logic in moveRecoveredEditsFromTemp?  It flags corrupted logs and does some
other cleanup.  Also seems to find recovered.edits files with a .corrupt ending: see ZKSlitLog.isCorruptFlagFile....
That'd need refactoring and a rename from moveRecoveredEditsFromTemp to 'completeLogSplit'
or 'finish'?

Otherwise, looking through HLogSplitting and trying to recall issues we've run into w/ recovered.edits,
I think doing it in place can work.

Would suggest you look at the region open and replay of recovered.edits stuff too to see if
you see any possible issues there (I only went through HLogSplitting).

(That renaming stuff is pretty heavy duty stuff....but I'd have done the same to cordon off
a distributed operation)

Good stuff Chunhui.
                
> [MTTR] Do NN operations outside of the ZK EventThread in SplitLogManager
> ------------------------------------------------------------------------
>
>                 Key: HBASE-6309
>                 URL: https://issues.apache.org/jira/browse/HBASE-6309
>             Project: HBase
>          Issue Type: Improvement
>    Affects Versions: 0.92.1, 0.94.0, 0.96.0
>            Reporter: Jean-Daniel Cryans
>            Priority: Critical
>             Fix For: 0.96.0
>
>
> We found this issue during the leap second cataclysm which prompted a distributed splitting
of all our logs.
> I saw that none of the RS were splitting after some time while the master was showing
that it wasn't even 30% done. jstack'ing I saw this:
> {noformat}
> "main-EventThread" daemon prio=10 tid=0x00007f6ce46d8800 nid=0x5376 in
> Object.wait() [0x00007f6ce2ecb000]
>    java.lang.Thread.State: WAITING (on object monitor)
>         at java.lang.Object.wait(Native Method)
>         at java.lang.Object.wait(Object.java:485)
>         at org.apache.hadoop.ipc.Client.call(Client.java:1093)
>         - locked <0x00000005fdd661a0> (a org.apache.hadoop.ipc.Client$Call)
>         at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:226)
>         at $Proxy9.rename(Unknown Source)
>         at sun.reflect.GeneratedMethodAccessor29.invoke(Unknown Source)
>         at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>         at java.lang.reflect.Method.invoke(Method.java:597)
>         at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
>         at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
>         at $Proxy9.rename(Unknown Source)
>         at org.apache.hadoop.hdfs.DFSClient.rename(DFSClient.java:759)
>         at org.apache.hadoop.hdfs.DistributedFileSystem.rename(DistributedFileSystem.java:253)
>         at org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.moveRecoveredEditsFromTemp(HLogSplitter.java:553)
>         at org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.moveRecoveredEditsFromTemp(HLogSplitter.java:519)
>         at org.apache.hadoop.hbase.master.SplitLogManager$1.finish(SplitLogManager.java:138)
>         at org.apache.hadoop.hbase.master.SplitLogManager.getDataSetWatchSuccess(SplitLogManager.java:431)
>         at org.apache.hadoop.hbase.master.SplitLogManager.access$1200(SplitLogManager.java:95)
>         at org.apache.hadoop.hbase.master.SplitLogManager$GetDataAsyncCallback.processResult(SplitLogManager.java:1011)
>         at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:571)
>         at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:497)
> {noformat}
> We are effectively bottlenecking on doing NN operations and whatever else is happening
in GetDataAsyncCallback. It was so bad that on our 100 offline cluster it took a few hours
for the master to process all the incoming ZK events while the actual splitting took a fraction
of that time.
> I'm marking this as critical and against 0.96 but depending on how involved the fix is
we might want to backport.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message