accumulo-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Elser <josh.el...@gmail.com>
Subject Re: Randomwalk unbalanced servers - still an issue?
Date Thu, 13 Feb 2014 18:46:59 GMT
Nifty - I was just running a single Concurrent client and got hit by the 
unbalanced exception.

Makes me wonder if something changed from December/early January when I 
was running 1.6 test much more heavily.

On 2/10/14, 11:33 AM, Josh Elser wrote:
> On 2/10/14, 10:33 AM, Bill Havanki wrote:
>> I used the standard agitation intervals. I don't understand enough about
>> the system yet to ascertain why tablets stayed unbalanced. One
>> possibility
>> is the timing of the checks and how that interacted with the 15-minute
>> time
>> allowance and minimum count:
>>
>> 1. The first failure condition occurred at 11:36, starting the 15-minute
>> clock.
>> 2. The second failure condition was at the next check 30 minutes later.
>> 3. A rapid succession of checks in the next two minutes pushed the
>> failure
>> count up high enough.
>>
>> It's possible that the tablets became balanced, and then unbalanced
>> again,
>> between steps 1 and 2, so the time allowance was defeated.
>
> Precisely. You could easily have gotten "bad luck" and had some splits
> right before one of these balances checks which pushed you out of
> balance. Diagnosing the "why" here is definitely an annoyance but good
> to do to make sure you didn't stumble on a bug. Typically cross-ref'ing
> the RW logs to the master log is sufficient to figure out what was
> happening.
>
>> Anyway, I restarted the randomwalk and it ran successfully for over 24
>> hours with agitation.
>>
>>
>> On Sun, Feb 9, 2014 at 7:25 PM, Josh Elser <josh.elser@gmail.com> wrote:
>>
>>> Interesting - I think I might have run into that once a whole bunch
>>> of RW
>>> runs.
>>>
>>> I assume you didn't change the agitation intervals from what's in the
>>> example? The parameters as they stand are, I think, acceptable. Being
>>> unbalanced for that long doesn't seem right. Did you identify why you
>>> were
>>> unbalanced?
>>>
>>> I'm not sure making that configurable is good either as you're now
>>> skewing
>>> one randomwalk test to another (in addition to the variance you already
>>> have from resources available).
>>>
>>> Personally, if you run into this, and you can identify that there was a
>>> legitimate reason to be unbalanced across that time and those checks,
>>> I'd
>>> be more in favor of just restarting that RW client.
>>>
>>>
>>> On 2/8/14, 11:50 AM, Bill Havanki wrote:
>>>
>>>> While running 1.5.1 rc1 through randomwalk I hit a failure in the
>>>> Concurrent test due to the tablet servers being "unbalanced". See
>>>> ACCUMULO-2198 for some background on the last time I ran into this.
>>>>
>>>> What is the general feeling on dealing with this failure? Is a
>>>> 15-minute
>>>> period too short to wait for balancing, or three consecutive
>>>> failures too
>>>> few to allow? I'm using only a 7-node cluster with 5 tservers, maybe an
>>>> unbalanced condition is more tolerable then?
>>>>
>>>> The parameters defining "unbalanced" aren't configurable at the moment,
>>>> and
>>>> I'm inclined to file a JIRA to make them so, to shepherd the test
>>>> through,
>>>> but I'd love to hear what you think about the importance and proper
>>>> parameters for this check.
>>>>
>>>> Thanks,
>>>> Bill
>>>>
>>>>
>>
>>

Mime
View raw message