hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Allan Yang (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (HBASE-18144) Forward-port the old exclusive row lock; there are scenarios where it performs better
Date Fri, 16 Jun 2017 11:53:00 GMT

    [ https://issues.apache.org/jira/browse/HBASE-18144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16051701#comment-16051701
] 

Allan Yang edited comment on HBASE-18144 at 6/16/17 11:52 AM:
--------------------------------------------------------------

Hi,[~stack], after a lot of debugging and logging, I finally figured out why disordered batch
will cause this situation. 
For example (the UT in DisorderedBatchAndIncrementUT.patch )
*handler 1* is doing a batch put of row(1,2,3,4,5,6,7,8,9). At the same time, *handler 4*
is doing a batch put but with reversed keys(9,8,7,6,5,4,3,2,1). 
1. *handler 1* have got readlock for row 1,2,3,4,5,6,7, and going to try row 8's readlock
2. *handler 4* have got readlock for row 9,8,7,6,5,4,3, and going to try row 2's readlock
3. At the same time, *handler 0* is serving a request to increment row 2, it need to get the
writelock of row 2, but it have to wait since *handler 1* have already got row 2's readlock
(*handler 0* blocked)
4. since *handler 0* is trying to get row 2' writelock, *handler 4*'s attempt to try row 2's
readlock need to wait(*handler 4* blocked)
5. At the same time,*handler 3* is serving a request to increment row 8, it need to get the
writelock of row 8, but it have to wait since *handler 4* have already got row 8's readlock
(*handler 3* blocked)
6. since *handler 3* is trying to get row 8' writelock, handler 1's attempt to try row 8's
readlock need to wait(*handler 1* blocked)

At this point, handler 0,1,3,4 is all blocked!!!! Until one thread is timeout after rowLockWaitDuration.
{code}
if (!result.getLock().tryLock(this.rowLockWaitDuration, TimeUnit.MILLISECONDS)) {
        if (traceScope != null) {
          traceScope.getSpan().addTimelineAnnotation("Failed to get row lock");
        }
        result = null;
        // Clean up the counts just in case this was the thing keeping the context alive.
        rowLockContext.cleanUp();
        throw new IOException("Timed out waiting for lock for row: " + rowKey);
      }
{code}

So, if all batches are sorted, there will be no such problem!

**Why branch-1.1 don't have this kind of problem**
This is because it simplely don't wait for the lock!
{code}
        // If we haven't got any rows in our batch, we should block to
        // get the next one.
        boolean shouldBlock = numReadyToWrite == 0;
        RowLock rowLock = null;
        try {
          rowLock = getRowLockInternal(mutation.getRow(), shouldBlock);
        } catch (IOException ioe) {
          LOG.warn("Failed getting lock in batch put, row="
            + Bytes.toStringBinary(mutation.getRow()), ioe);
        }
{code}

**Conclusion**
1. Commit patch HBASE-17924 to branch-1.2
2. We shouldn't wait for the lock in dominibatchmutation(like branch-1.1 did), will open another
issue to disccuss.


was (Author: allan163):
Hi,[~stack], after a lot of debugging and logging, I finally figured out why disordered batch
will cause this situation. 
For example (the UT in DisorderedBatchAndIncrementUT.patch )
*handler 1* is doing a batch put of row(1,2,3,4,5,6,7,8,9). At the same time, *handler 4*
is doing a batch put but with reversed keys(9,8,7,6,5,4,3,2,1). 
1. *handler 1* have got readlock for row 1,2,3,4,5,6,7, and going to try row 8's readlock
2. *handler 4* have got readlock for row 9,8,7,6,5,4,3, and going to try row 2's readlock
3. At the same time, *handler 0* is serving a request to increment row 2, it need to get the
writelock of row 2, but it have to wait since *handler 1* have already got row 2's readlock
(*handler 0* blocked)
4. since *handler 0* is trying to get row 2' writelock, *handler 4*'s attempt to try row 2's
readlock need to wait(*handler 4* blocked)
5. At the same time,*handler 3* is serving a request to increment row 8, it need to get the
writelock of row 8, but it have to wait since *handler 4* have already got row 2's readlock
(*handler 3* blocked)
6. since *handler 3* is trying to get row 8' writelock, handler 1's attempt to try row 8's
readlock need to wait(*handler 1* blocked)

At this point, handler 0,1,3,4 is all blocked!!!! Until one thread is timeout after rowLockWaitDuration.
{code}
if (!result.getLock().tryLock(this.rowLockWaitDuration, TimeUnit.MILLISECONDS)) {
        if (traceScope != null) {
          traceScope.getSpan().addTimelineAnnotation("Failed to get row lock");
        }
        result = null;
        // Clean up the counts just in case this was the thing keeping the context alive.
        rowLockContext.cleanUp();
        throw new IOException("Timed out waiting for lock for row: " + rowKey);
      }
{code}

So, if all batches are sorted, there will be no such problem!

**Why branch-1.1 don't have this kind of problem**
This is because it simplely don't wait for the lock!
{code}
        // If we haven't got any rows in our batch, we should block to
        // get the next one.
        boolean shouldBlock = numReadyToWrite == 0;
        RowLock rowLock = null;
        try {
          rowLock = getRowLockInternal(mutation.getRow(), shouldBlock);
        } catch (IOException ioe) {
          LOG.warn("Failed getting lock in batch put, row="
            + Bytes.toStringBinary(mutation.getRow()), ioe);
        }
{code}

**Conclusion**
1. Commit patch HBASE-17924 to branch-1.2
2. We shouldn't wait for the lock in dominibatchmutation(like branch-1.1 did), will open another
issue to disccuss.

> Forward-port the old exclusive row lock; there are scenarios where it performs better
> -------------------------------------------------------------------------------------
>
>                 Key: HBASE-18144
>                 URL: https://issues.apache.org/jira/browse/HBASE-18144
>             Project: HBase
>          Issue Type: Bug
>          Components: Increment
>    Affects Versions: 1.2.5
>            Reporter: stack
>            Assignee: stack
>             Fix For: 2.0.0, 1.3.2, 1.2.7
>
>         Attachments: DisorderedBatchAndIncrementUT.patch, HBASE-18144.master.001.patch
>
>
> Description to follow.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message