hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jonathan Hsieh (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-10079) Increments lost after flush
Date Thu, 05 Dec 2013 05:10:39 GMT

    [ https://issues.apache.org/jira/browse/HBASE-10079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13839827#comment-13839827
] 

Jonathan Hsieh commented on HBASE-10079:
----------------------------------------

{code}
  /**
   * Check that the object does not exist already. There are two reasons for creating the
objects
   * only once:
   * 1) With 100K regions, the table names take ~20MB.
   * 2) Equals becomes much faster as it's resolved with a reference and an int comparison.
   */
01  private static TableName createTableNameIfNecessary(ByteBuffer bns, ByteBuffer qns) {
02     for (TableName tn : tableCache) {
03       if (Bytes.equals(tn.getQualifier(), qns) && Bytes.equals(tn.getNamespace(),
bns)) {
04         return tn;
05       }
06     }
07 
08     TableName newTable = new TableName(bns, qns);
09     if (tableCache.add(newTable)) {  // Adds the specified element if it is not already
present
10      return newTable;
11    } else {
12      // Someone else added it. Let's find it.
13      for (TableName tn : tableCache) {
14        if (Bytes.equals(tn.getQualifier(), qns) && Bytes.equals(tn.getNamespace(),
bns)) {
15          return tn;
16        }
17      }
18    }
19
20    throw new IllegalStateException(newTable + " was supposed to be in the cache");
21  }
{code}

Here's the race:

We have two concurrent calls to createTableNameIfNecessary to the same namespace (which gets
wrapped and becomes bns) and table qualifier (which gets wrapped and becomes qns) -- ns=default
and 
tn=test in my rig's case.

Thread one executes to line 08.  bns and qns are consumed by the get's in the TableName(BB,BB)
constructor.
Thread two executes to line 08.  bns and qns are consumed by the get's in the TableName(BB,BB)
constructor.
Thread two returns true at line 09, and exits returns newTable at line 10.
Thread one returns false since Thread two's TableName made it in.  It jumps and continues
executing at line 12
Thread one's at line 14's first  Bytes.equals methods compares the byte[] tn.getQualifier
against qns (which is a consumed BB, and thus has no more data on get).  This essentially
always will fail.  
Thread one loops throw, falls out, and then throws the IllegalStateException.

So anytime we get to line 14, we'll fail.  

Solution is to make sure the constructor dups bns and qns before extraction the byte[]'s.
 Patch coming.



> Increments lost after flush 
> ----------------------------
>
>                 Key: HBASE-10079
>                 URL: https://issues.apache.org/jira/browse/HBASE-10079
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver
>    Affects Versions: 0.96.1
>            Reporter: Jonathan Hsieh
>            Assignee: Jonathan Hsieh
>            Priority: Blocker
>             Fix For: 0.98.0, 0.96.1, 0.99.0
>
>         Attachments: 10079.v1.patch
>
>
> Testing 0.96.1rc1.
> With one process incrementing a row in a table, we increment single col.  We flush or
do kills/kill-9 and data is lost.  flush and kill are likely the same problem (kill would
flush), kill -9 may or may not have the same root cause.
> 5 nodes
> hadoop 2.1.0 (a pre cdh5b1 hdfs).
> hbase 0.96.1 rc1 
> Test: 250000 increments on a single row an single col with various number of client threads
(IncrementBlaster).  Verify we have a count of 250000 after the run (IncrementVerifier).
> Run 1: No fault injection.  5 runs.  count = 250000. on multiple runs.  Correctness verified.
 1638 inc/s throughput.
> Run 2: flushes table with incrementing row.  count = 246875 !=250000.  correctness failed.
 1517 inc/s throughput.  
> Run 3: kill of rs hosting incremented row.  count = 243750 != 250000. Correctness failed.
  1451 inc/s throughput.
> Run 4: one kill -9 of rs hosting incremented row.  246878.!= 250000.  Correctness failed.
1395 inc/s (including recovery)



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Mime
View raw message