hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Nicolas Liochon (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-10000) Initiate lease recovery for outstanding WAL files at the very beginning of recovery
Date Mon, 09 Dec 2013 09:39:09 GMT

    [ https://issues.apache.org/jira/browse/HBASE-10000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13843012#comment-13843012
] 

Nicolas Liochon commented on HBASE-10000:
-----------------------------------------

I reviewed the 0.96 version:

I found only one serious issue for a very specific config. Everything else if nit. Still,
the code is very complex, I spent more than 1 hour reviewing these 10 lines, but I'm not sure
I covered all cases...
bq.         if (nbAttempt == 0 && isFileClosedMeth == null && firstPause >
0) {
if the master initiated the recovery more than 4 seconds ago AND there is not isFileClosed
on the region server or namenode, we will do the second the recoverLease after 1 minute: our
recovery time will always be > 1 minute with this implementation. It's serious, because
may be the file is actually recovered but we don't even check.
With the previous implementation, we would have done the second attempt after 4 seconds.

Nits:

bq.       if (recovered) break;
Would be in the loop above. Basically; we don't really need the 'recovered' variable, we always
return immediately when it's true.

{code}
            if (isFileClosedMeth != null && isFileClosed(dfs, isFileClosedMeth, p))
{
              recovered = true;
              break;
            }
{code}

could be
{code}
            if (isFileClosedMeth != null && isFileClosed(dfs, isFileClosedMeth, p))
{
              return true;
            }
{code}

bq.  // On the first time through wait the short 'firstPause'
The comment does not match the code (isFileClosedMeth and master recovery time are actually
taken into account). We actually do a firstPause only if isFileClosed is not available, and
then we calculate the firstPause time by taking into account the time when the recovery was
started, master included.

I wonder if we should not max the firstPause, for example if the master time is not synchronized
with the region server. Considering the criticallity of this code it would be safer imho.




> Initiate lease recovery for outstanding WAL files at the very beginning of recovery
> -----------------------------------------------------------------------------------
>
>                 Key: HBASE-10000
>                 URL: https://issues.apache.org/jira/browse/HBASE-10000
>             Project: HBase
>          Issue Type: Improvement
>            Reporter: Ted Yu
>            Assignee: Ted Yu
>             Fix For: 0.98.1
>
>         Attachments: 10000-0.96-v5.txt, 10000-recover-ts-with-pb-2.txt, 10000-recover-ts-with-pb-3.txt,
10000-recover-ts-with-pb-4.txt, 10000-recover-ts-with-pb-5.txt, 10000-v4.txt, 10000-v5.txt,
10000-v6.txt
>
>
> At the beginning of recovery, master can send lease recovery requests concurrently for
outstanding WAL files using a thread pool.
> Each split worker would first check whether the WAL file it processes is closed.
> Thanks to Nicolas Liochon and Jeffery discussion with whom gave rise to this idea. 



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)

Mime
View raw message