hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Devaraj Das <d...@hortonworks.com>
Subject Re: Problem with IntegrationTestRegionReplicaReplication
Date Fri, 16 Jun 2017 05:09:16 GMT
Peter, do have a look at IntegrationTestRegionReplicaReplication.java .. At the top of the
file, the ways to specify the options are documented .. You need to add something like -DIntegrationTestRegionReplicaReplication.read_delay_ms
From: Josh Elser <josh.elser@gmail.com>
Sent: Thursday, June 15, 2017 10:40 AM
To: dev@hbase.apache.org
Subject: Re: Problem with IntegrationTestRegionReplicaReplication

I'd start trying a read_delay_ms=60000, region_replication=2,
num_keys_per_server=5000, num_regions_per_server=5 with a maybe 10's of
reader and writer threads.

Again, this can be quite dependent on the kind of hardware you have.
You'll definitely have to tweak ;)

On 6/15/17 4:44 AM, Peter Somogyi wrote:
> Thanks Josh and Devaraj!
> I will try to increase the timeouts. Devaraj, could you share the
> parameters you used for this test which worked?
> On Thu, Jun 15, 2017 at 6:44 AM, Devaraj Das <ddas@hortonworks.com> wrote:
>> That sounds about right, Josh. Peter, in our internal testing we have seen
>> this test failing and increasing timeouts (look at the test code options to
>> do with increasing timeout) helped quite some.
>> ________________________________________
>> From: Josh Elser <josh.elser@gmail.com>
>> Sent: Wednesday, June 14, 2017 3:17 PM
>> To: dev@hbase.apache.org
>> Subject: Re: Problem with IntegrationTestRegionReplicaReplication
>> On 6/14/17 3:53 AM, Peter Somogyi wrote:
>>> Hi,
>>> As one of my first task with HBase I started to look into
>>> why IntegrationTestRegionReplicaReplication fails. I would like to get
>> some
>>> suggestions from you.
>>> I noticed when I run the test using normal cluster or minicluster I get
>> the
>>> same error messages: "Error checking data for key [null], no data
>>> returned". I looked into the code and here are my conclusions.
>>> There are multiple threads writing data parallel which are read by
>> multiple
>>> reader threads simultaneously. Each writer gets a portion of the keys to
>>> write (e.g. 0-2000) and these keys are added to a ConstantDelayQueue.
>>> The reader threads get the elements (e.g. key=1000) from the queue and
>>> these reader threads assume that all the keys up to this are already in
>> the
>>> database. Since we're using multiple writers it can happen that another
>>> thread has not yet written key=500 and verifying these keys will cause
>> the
>>> test failure.
>>> Do you think my assumption is correct?
>> Hi Peter,
>> No, as my memory serves, this is not correct. Readers are not made aware
>> of keys to verify until the write occur plus some delay. The delay is
>> used to provide enough time for the internal region replication to take
>> effect.
>> So: primary-write, pause, [region replication happens in background],
>> add updated key to read queue, reader gets key from queue verifies the
>> value on a replica.
>> The primary should always have seen the new value for a key. If the test
>> is showing that a replica does not see the result, it's either a timing
>> issue (you need to give a larger delay for HBase to perform the region
>> replication) or a bug in the region replication framework itself. That
>> said, if you can show that you are seeing what you describe, that sounds
>> like the test framework itself is broken :)

View raw message