hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Nicolas Liochon (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-8449) Refactor recoverLease retries and pauses informed by findings over in hbase-8389
Date Thu, 23 May 2013 20:23:21 GMT

    [ https://issues.apache.org/jira/browse/HBASE-8449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13665613#comment-13665613
] 

Nicolas Liochon commented on HBASE-8449:
----------------------------------------

Increase hbase.lease.recovery.timeout default to 15 minutes, i.e. more than a standard hdfs
recovery.
hbase.lease.recovery.dfs.timeout: it should not be less than 10s imho. It's not only a question
of dfs timeout, it's as well that it seems that the NN seems not to like multiple calls to
the recoverLease. I tested again multiple calls, the datanodes logs were complaining about
"situation that should never occurs". Ok, it was with multi calls with an interval of 1 second,
but it seems to be all luck.

+   * 1. Call recoverLease.
+   * 2. If it returns true, break.
+   * 3. If it returns false, wait a few seconds and then call it again.
+   * 4. If it returns true, break.
+   * 5. If it returns false, wait for what we think the datanode socket timeout is
+   * (configurable) and then try again.
+   * 6. If it returns true, break.
+   * 7. If it returns false, repeat starting at step 5. above.


I would propose:
the master
   - if HDFS-4754 is there, the master marks the node as stale as the first step of the recovery.
   - The master calls recover lease as a part of the distributed split. We can enhance it
in an other jira to give higher priority to closed wals vs. wals being recovered.

the region server:
    - calls isFileCLosed, if it's there. if true returns
    - Calls recoverLease, if true, return
    - if isFileCLosed is available, loop on it with a 1s sleep 
    - else loops on 70s (configurable) sleep with recover lease




                
> Refactor recoverLease retries and pauses informed by findings over in hbase-8389
> --------------------------------------------------------------------------------
>
>                 Key: HBASE-8449
>                 URL: https://issues.apache.org/jira/browse/HBASE-8449
>             Project: HBase
>          Issue Type: Bug
>          Components: Filesystem Integration
>    Affects Versions: 0.94.7, 0.95.0
>            Reporter: stack
>            Assignee: stack
>            Priority: Critical
>             Fix For: 0.95.1
>
>         Attachments: 8449.txt, 8449v2.txt, 8449v3.txt, 8449v4.txt
>
>
> HBASE-8359 is an interesting issue that roams near and far.  This issue is about making
use of the findings handily summarized on the end of hbase-8359 which have it that trunk needs
refactor around how it does its recoverLease handling (and that the patch committed against
HBASE-8359 is not what we want going forward).
> This issue is about making a patch that adds a lag between recoverLease invocations where
the lag is related to dfs timeouts -- the hdfs-side dfs timeout -- and optionally makes use
of the isFileClosed API if it is available (a facility that is not yet committed to a branch
near you and unlikely to be within your locality with a good while to come).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message