accumulo-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bill Havanki <>
Subject Re: Randomwalk unbalanced servers - still an issue?
Date Thu, 13 Feb 2014 18:54:03 GMT
Part of what I did in ACCUMULO-2198 had the potential to increase failures.
>From the commit message:

In addition, the test logic would reset the timestamp every time servers
were found
 unbalanced, provided the 15-minute allowance hadn't expired. This commit
fixes that
issue as well. This could lead to more, correct, reports of unbalanced

I've been wondering if we just don't need a more robust way of determining
that servers have been unbalanced for too long, even so far as drawing
historical information from the servers themselves instead of relying on
random points in time in the test.

On Thu, Feb 13, 2014 at 1:46 PM, Josh Elser <> wrote:

> Nifty - I was just running a single Concurrent client and got hit by the
> unbalanced exception.
> Makes me wonder if something changed from December/early January when I
> was running 1.6 test much more heavily.
> On 2/10/14, 11:33 AM, Josh Elser wrote:
>> On 2/10/14, 10:33 AM, Bill Havanki wrote:
>>> I used the standard agitation intervals. I don't understand enough about
>>> the system yet to ascertain why tablets stayed unbalanced. One
>>> possibility
>>> is the timing of the checks and how that interacted with the 15-minute
>>> time
>>> allowance and minimum count:
>>> 1. The first failure condition occurred at 11:36, starting the 15-minute
>>> clock.
>>> 2. The second failure condition was at the next check 30 minutes later.
>>> 3. A rapid succession of checks in the next two minutes pushed the
>>> failure
>>> count up high enough.
>>> It's possible that the tablets became balanced, and then unbalanced
>>> again,
>>> between steps 1 and 2, so the time allowance was defeated.
>> Precisely. You could easily have gotten "bad luck" and had some splits
>> right before one of these balances checks which pushed you out of
>> balance. Diagnosing the "why" here is definitely an annoyance but good
>> to do to make sure you didn't stumble on a bug. Typically cross-ref'ing
>> the RW logs to the master log is sufficient to figure out what was
>> happening.
>>  Anyway, I restarted the randomwalk and it ran successfully for over 24
>>> hours with agitation.
>>> On Sun, Feb 9, 2014 at 7:25 PM, Josh Elser <> wrote:
>>>  Interesting - I think I might have run into that once a whole bunch
>>>> of RW
>>>> runs.
>>>> I assume you didn't change the agitation intervals from what's in the
>>>> example? The parameters as they stand are, I think, acceptable. Being
>>>> unbalanced for that long doesn't seem right. Did you identify why you
>>>> were
>>>> unbalanced?
>>>> I'm not sure making that configurable is good either as you're now
>>>> skewing
>>>> one randomwalk test to another (in addition to the variance you already
>>>> have from resources available).
>>>> Personally, if you run into this, and you can identify that there was a
>>>> legitimate reason to be unbalanced across that time and those checks,
>>>> I'd
>>>> be more in favor of just restarting that RW client.
>>>> On 2/8/14, 11:50 AM, Bill Havanki wrote:
>>>>  While running 1.5.1 rc1 through randomwalk I hit a failure in the
>>>>> Concurrent test due to the tablet servers being "unbalanced". See
>>>>> ACCUMULO-2198 for some background on the last time I ran into this.
>>>>> What is the general feeling on dealing with this failure? Is a
>>>>> 15-minute
>>>>> period too short to wait for balancing, or three consecutive
>>>>> failures too
>>>>> few to allow? I'm using only a 7-node cluster with 5 tservers, maybe
>>>>> unbalanced condition is more tolerable then?
>>>>> The parameters defining "unbalanced" aren't configurable at the moment,
>>>>> and
>>>>> I'm inclined to file a JIRA to make them so, to shepherd the test
>>>>> through,
>>>>> but I'd love to hear what you think about the importance and proper
>>>>> parameters for this check.
>>>>> Thanks,
>>>>> Bill

| - - -
| Bill Havanki
| Solutions Architect, Cloudera Government Solutions
| - - -

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message