hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hudson (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-6738) Too aggressive task resubmission from the distributed log manager
Date Tue, 09 Apr 2013 07:34:28 GMT

    [ https://issues.apache.org/jira/browse/HBASE-6738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13626349#comment-13626349

Hudson commented on HBASE-6738:

Integrated in HBase-0.94-security #133 (See [https://builds.apache.org/job/HBase-0.94-security/133/])
    HBASE-8276 Backport hbase-6738 to 0.94 "Too aggressive task resubmission from the distributed
log manager" (Jeffrey) (Revision 1465161)

     Result = SUCCESS
> Too aggressive task resubmission from the distributed log manager
> -----------------------------------------------------------------
>                 Key: HBASE-6738
>                 URL: https://issues.apache.org/jira/browse/HBASE-6738
>             Project: HBase
>          Issue Type: Bug
>          Components: master, regionserver
>    Affects Versions: 0.94.1, 0.95.2
>         Environment: 3 nodes cluster test, but can occur as well on a much bigger one.
It's all luck!
>            Reporter: Nicolas Liochon
>            Assignee: Nicolas Liochon
>            Priority: Critical
>             Fix For: 0.95.0
>         Attachments: 6738.v1.patch
> With default settings for "hbase.splitlog.manager.timeout" => 25s and "hbase.splitlog.max.resubmit"
=> 3.
> On tests mentionned on HBASE-5843, I have variations around this scenario, 0.94 + HDFS
> The regionserver in charge of the split does not answer in less than 25s, so it gets
interrupted but actually continues. Sometimes, we go out of the number of retry, sometimes
not, sometimes we're out of retry, but the as the interrupts were ignored we finish nicely.
In the mean time, the same single task is executed in parallel by multiple nodes, increasing
the probability to get into race conditions.
> Details:
> t0: unplug a box with DN+RS
> t + x: other boxes are already connected, to their connection starts to dies. Nevertheless,
they don't consider this node as suspect.
> t + 180s: zookeeper -> master detects the node as dead. recovery start. It can be
less than 180s sometimes it around 150s.
> t + 180s: distributed split starts. There is only 1 task, it's immediately acquired by
a one RS.
> t + 205s: the RS has multiple errors when splitting, because a datanode is missing as
well. The master decides to give the task to someone else. But often the task continues in
the first RS. Interrupts are often ignored, as it's well stated in the code ("// TODO interrupt
often gets swallowed, do what else?")
> {code}
>    2012-09-04 18:27:30,404 INFO org.apache.hadoop.hbase.regionserver.SplitLogWorker:
Sending interrupt to stop the worker thread
> {code}
> t + 211s: two regionsservers are processing the same task. They fight for the leases:
> {code}
> 2012-09-04 18:27:32,004 WARN org.apache.hadoop.hdfs.DFSClient: DataStreamer Exception:
org.apache.hadoop.ipc.RemoteException:          org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException:
Lease mismatch on
>    /hbase/TABLE/4d1c1a4695b1df8c58d13382b834332e/recovered.edits/0000000000000000037.temp
owned by DFSClient_hb_rs_BOX2,60020,1346775882980 but is accessed by DFSClient_hb_rs_BOX1,60020,1346775719125
> {code}
>      They can fight like this for many files, until the tasks finally get interrupted
or finished.
>      The taks on the second box can be cancelled as well. In this case, the task is created
again for a new box.
>      The master seems to stop after 3 attemps. It can as well renounce to split the files.
Sometimes the tasks were not cancelled on the RS side, so the split is finished despites what
the master thinks and logs. In this case, the assignement starts. In the other, it's "we've
got a problem").
> {code}
> 2012-09-04 18:43:52,724 INFO org.apache.hadoop.hbase.master.SplitLogManager: Skipping
resubmissions of task /hbase/splitlog/hdfs%3A%2F%2FBOX1%3A9000%2Fhbase%2F.logs%2FBOX0%2C60020%2C1346776587640-splitting%2FBOX0%252C60020%252C1346776587640.1346776587832
because threshold 3 reached     
> {code}
> t + 300s: split is finished. Assignement starts
> t + 330s: assignement is finished, regions are available again.
> There are a lot of subcases possible depending on the number of logs files, of region
server and so on.
> The issues are:
> 1) it's difficult, especially in HBase but not only, to interrupt a task. The pattern
is often
> {code}
>  void f() throws IOException{
>   try {
>      // whatever throw InterruptedException
>   }catch(InterruptedException){
>     throw new InterruptedIOException();
>   }
> }
>  boolean g(){
>    int nbRetry= 0;  
>    for(;;)
>       try{
>          f();
>          return true;
>       }catch(IOException e){
>          nbRetry++;
>          if ( nbRetry > maxRetry) return false;
>       }
>    } 
>  }
> {code}
> This tyically shallows the interrupt. There are other variation, but this one seems to
be the standard.
> Even if we fix this in HBase, we need the other layers to be Interrupteble as well. That's
not proven.
> 2) 25s is very aggressive, considering that we have a default timeout of 180s for zookeeper.
In other words, we give 180s to a regionserver before acting, but when it comes to split,
it's 25s only. There may be reasons for this, but it seems dangerous, as during a failure
the cluster is less available than during normal operations. We could do stuff around this,
for example:
> => Obvious option: increase the timeout at each try. Something like *2.
> => Also possible: increase the initial timeout
> => check for an update instead of blindly cancelling + resubmitting.
> 3) Globally, it seems that this retry mechanism duplicates the failure detection already
in place with ZK. Would it not make sense to just hook into this existing detection mechanism,
and resubmit a task if and only if we detect that the regionserver in charge died? During
a failure scenario we should be much more gentle than during normal operation, not the opposite.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

View raw message