hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ramkrishna.s.vasudevan (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-4855) SplitLogManager hangs on cluster restart.
Date Thu, 24 Nov 2011 16:55:40 GMT

    [ https://issues.apache.org/jira/browse/HBASE-4855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13156820#comment-13156820
] 

ramkrishna.s.vasudevan commented on HBASE-4855:
-----------------------------------------------

When the master restarts and sees splitlog nodes which are not processed the SplitLogManager
does handleUnassignedTasks
{code}
Task task = findOrCreateOrphanTask(path);
{code}
As part of which 
{code}
task = tasks.putIfAbsent(path, orphanTask);
{code}
Ths task is added.  Later in splitLogDistributed() we try to installTask().

Here we create the task if absent
{code}
Task oldtask = createTaskIfAbsent(path, batch);
{code}
Inside createTaskIfAbsent()
{code}
    oldtask = tasks.putIfAbsent(path, new Task(batch));
    if (oldtask != null && oldtask.isOrphan()) {
        LOG.info("Previously orphan task " + path +
            " is now being waited upon");
        oldtask.setBatch(batch);
        return (null);
    }
{code}
the putIfAbsent returns the already added task so oldtask is not null.
Already while doing new Task(batch) 
{code}
   Task(TaskBatch tb) {
      incarnation = 0;
      last_version = -1;
      deleted = false;
      setBatch(tb);
      setUnassigned();
    }

    public void setBatch(TaskBatch batch) {
      if (batch != null && this.batch != null) {
        LOG.fatal("logic error - batch being overwritten");
      }
      this.batch = batch;
      if (batch != null) {
        batch.installed++;
      }
    }
{code}
the batch.installed++ happens.  Since the oldtask is not null once again we call
oldtask.setBatch(batch) making the batch.installed to increment once again.

This is why batch.done is not able to reach this batch.installed and hence the while loop
keeps looping.
{code}
while ((batch.done + batch.error) != batch.installed) {
{code}

Pls correct me if my analysis is wrong.  I am uploading a patch which solved the problem.
 Kindly validate the fix.

                
> SplitLogManager hangs on cluster restart. 
> ------------------------------------------
>
>                 Key: HBASE-4855
>                 URL: https://issues.apache.org/jira/browse/HBASE-4855
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.92.0
>            Reporter: ramkrishna.s.vasudevan
>            Assignee: ramkrishna.s.vasudevan
>
> Start a master and RS
> RS goes down (kill -9)
> Wait for ServerShutDownHandler to create the splitlog nodes. As no RS is there it cannot
be processed.
> Restart both master and bring up an RS.
> The master hangs in SplitLogManager.waitforTasks().
> I feel that batch.done is not getting incremented properly.  Not yet digged in fully.
> This may be the reason for occasional failure of TestDistributedLogSplitting.testWorkerAbort().


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message