accumulo-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bill Havanki <bhava...@clouderagovt.com>
Subject Re: Randomwalk unbalanced servers - still an issue?
Date Mon, 10 Feb 2014 15:33:39 GMT
I used the standard agitation intervals. I don't understand enough about
the system yet to ascertain why tablets stayed unbalanced. One possibility
is the timing of the checks and how that interacted with the 15-minute time
allowance and minimum count:

1. The first failure condition occurred at 11:36, starting the 15-minute
clock.
2. The second failure condition was at the next check 30 minutes later.
3. A rapid succession of checks in the next two minutes pushed the failure
count up high enough.

It's possible that the tablets became balanced, and then unbalanced again,
between steps 1 and 2, so the time allowance was defeated.

Anyway, I restarted the randomwalk and it ran successfully for over 24
hours with agitation.


On Sun, Feb 9, 2014 at 7:25 PM, Josh Elser <josh.elser@gmail.com> wrote:

> Interesting - I think I might have run into that once a whole bunch of RW
> runs.
>
> I assume you didn't change the agitation intervals from what's in the
> example? The parameters as they stand are, I think, acceptable. Being
> unbalanced for that long doesn't seem right. Did you identify why you were
> unbalanced?
>
> I'm not sure making that configurable is good either as you're now skewing
> one randomwalk test to another (in addition to the variance you already
> have from resources available).
>
> Personally, if you run into this, and you can identify that there was a
> legitimate reason to be unbalanced across that time and those checks, I'd
> be more in favor of just restarting that RW client.
>
>
> On 2/8/14, 11:50 AM, Bill Havanki wrote:
>
>> While running 1.5.1 rc1 through randomwalk I hit a failure in the
>> Concurrent test due to the tablet servers being "unbalanced". See
>> ACCUMULO-2198 for some background on the last time I ran into this.
>>
>> What is the general feeling on dealing with this failure? Is a 15-minute
>> period too short to wait for balancing, or three consecutive failures too
>> few to allow? I'm using only a 7-node cluster with 5 tservers, maybe an
>> unbalanced condition is more tolerable then?
>>
>> The parameters defining "unbalanced" aren't configurable at the moment,
>> and
>> I'm inclined to file a JIRA to make them so, to shepherd the test through,
>> but I'd love to hear what you think about the importance and proper
>> parameters for this check.
>>
>> Thanks,
>> Bill
>>
>>


-- 
| - - -
| Bill Havanki
| Solutions Architect, Cloudera Government Solutions
| - - -

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message