hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ashu Pachauri (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-15001) Thread Safety issues in ReplicationSinkManager and HBaseInterClusterReplicationEndpoint
Date Fri, 18 Dec 2015 01:22:46 GMT

    [ https://issues.apache.org/jira/browse/HBASE-15001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15063238#comment-15063238
] 

Ashu Pachauri commented on HBASE-15001:
---------------------------------------

[~tedyu] I am sorry, I did not notice you were talking about race condition in point 2 of
my description. I did not reproduce it in unit tests. When I try to reproduce it, it's not
as bad as I thought. It just errors out on 
{code}
entryLists.get(Math.abs(Bytes.hashCode(e.getKey().getEncodedRegionName())%n)).add(e);
{code}
due to division by zero.

The other race conditions are in ReplicationSinkPeer which cannot be deterministically reproduced
(as I dont have control over race conditions in an arralist). I just introduced a a random
failing sink behavior, and ran it with enough number of wals and edits to replicate.. and
I was able to reproduce that arraylist reports the size as negative, which stays that way
infinitely. I need to run somewhere right now, but I will post the code for this reproduction
later tonight.

> Thread Safety issues in ReplicationSinkManager and HBaseInterClusterReplicationEndpoint
> ---------------------------------------------------------------------------------------
>
>                 Key: HBASE-15001
>                 URL: https://issues.apache.org/jira/browse/HBASE-15001
>             Project: HBase
>          Issue Type: Bug
>          Components: Replication
>    Affects Versions: 2.0.0, 1.2.0, 1.3.0, 1.2.1
>            Reporter: Ashu Pachauri
>            Assignee: Ashu Pachauri
>            Priority: Blocker
>             Fix For: 2.0.0, 1.2.0, 1.3.0
>
>         Attachments: HBASE-15001-V0.patch
>
>
> ReplicationSinkManager is not thread-safe. This can cause problems in HBaseInterClusterReplicationEndpoint,
 when the walprovider is multiwal. 
> For example: 
> 1. When multiple threads report bad sinks, the sink list can be non-empty but report
a negative size because the ArrayList itself is not thread-safe. 
> 2. HBaseInterClusterReplicationEndpoint depends on the number of sinks to batch edits
for shipping. However, it's quite possible that the following code makes it assume that there
are no batches to process (sink size is non-zero, but by the time we reach the "batching"
part, sink size becomes zero.)
> {code}
> if (replicationSinkMgr.getSinks().size() == 0) {
>     return false;
> }
> ...
> int n = Math.min(Math.min(this.maxThreads, entries.size()/100+1),
>                replicationSinkMgr.getSinks().size());
> {code}
> This is very dangerous, because, (incorrectly) assuming no batches to process based on
value of n, we can safely report that we replicated successfully, while we actually did not
replicate anything. 
> The idea is to make all operations in ReplicationSinkManager thread-safe and do a verification
on the size of replicated edits before we report success.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message