hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ashu Pachauri (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HBASE-15001) Thread Safety issues in ReplicationSinkManager and HBaseInterClusterReplicationEndpoint
Date Fri, 18 Dec 2015 19:18:46 GMT

     [ https://issues.apache.org/jira/browse/HBASE-15001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Ashu Pachauri updated HBASE-15001:
----------------------------------
    Attachment: Test.java
                repro_stuck_replication.diff

Repro_stuck_replication:  This was a randomized version of the bug repro that I performed.
I have no idea how to deterministically reproduce the bug. I let it run on a loop for a few
hours last night and I got this in the logs this morning (stuck replication on one of the
node, it's stuck because once this happens, the sink list is never refreshed) :
{code}
2015-12-18 05:08:15,923 WARN  [main-EventThread.replicationSource,testInterClusterReplication.replicationSource.ashu-mbp.dhcp.thefacebook.com%2C53383%2C1450465675871.regiongroup-0,testInterClusterReplication]
regionserver.ReplicationSource$ReplicationSourceWorkerThread(1020): org.apache.hadoop.hbase.replication.TestReplicationEndpoint$InterClusterReplicationEndpointForTest
threw unknown exception:java.lang.IllegalArgumentException: Illegal Capacity: -1
        at java.util.ArrayList.<init>(ArrayList.java:156)
        at org.apache.hadoop.hbase.replication.regionserver.HBaseInterClusterReplicationEndpoint.replicate(HBaseInterClusterReplicationEndpoint.java:196)
        at org.apache.hadoop.hbase.replication.TestReplicationEndpoint$InterClusterReplicationEndpointForTest.replicate(TestReplicationEndpoint.java:330)
        at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$ReplicationSourceWorkerThread.shipEdits(ReplicationSource.java:983)
        at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$ReplicationSourceWorkerThread.run(ReplicationSource.java:653)
{code}

Anyways, there is no point needed to be made for using synchronized operations on an unsafe
container. But, just to be sure I performed a multithreaded write test on an ArrayList (Attached
Test.java) that it can report negative size. Here is the output after a few minutes of run:
{code}
List not empty, it's size is:  -1
List not empty, it's size is:  -1
{code}

> Thread Safety issues in ReplicationSinkManager and HBaseInterClusterReplicationEndpoint
> ---------------------------------------------------------------------------------------
>
>                 Key: HBASE-15001
>                 URL: https://issues.apache.org/jira/browse/HBASE-15001
>             Project: HBase
>          Issue Type: Bug
>          Components: Replication
>    Affects Versions: 2.0.0, 1.2.0, 1.3.0, 1.2.1
>            Reporter: Ashu Pachauri
>            Assignee: Ashu Pachauri
>            Priority: Blocker
>             Fix For: 2.0.0, 1.2.0, 1.3.0
>
>         Attachments: HBASE-15001-V0.patch, Test.java, repro_stuck_replication.diff
>
>
> ReplicationSinkManager is not thread-safe. This can cause problems in HBaseInterClusterReplicationEndpoint,
 when the walprovider is multiwal. 
> For example: 
> 1. When multiple threads report bad sinks, the sink list can be non-empty but report
a negative size because the ArrayList itself is not thread-safe. 
> 2. HBaseInterClusterReplicationEndpoint depends on the number of sinks to batch edits
for shipping. However, it's quite possible that the following code makes it assume that there
are no batches to process (sink size is non-zero, but by the time we reach the "batching"
part, sink size becomes zero.)
> {code}
> if (replicationSinkMgr.getSinks().size() == 0) {
>     return false;
> }
> ...
> int n = Math.min(Math.min(this.maxThreads, entries.size()/100+1),
>                replicationSinkMgr.getSinks().size());
> {code}
> This is very dangerous, because, (incorrectly) assuming no batches to process based on
value of n, we can safely report that we replicated successfully, while we actually did not
replicate anything. 
> The idea is to make all operations in ReplicationSinkManager thread-safe and do a verification
on the size of replicated edits before we report success.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message