accumulo-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Elser <josh.el...@gmail.com>
Subject Re: Randomwalk unbalanced servers - still an issue?
Date Thu, 13 Feb 2014 19:09:16 GMT
Ah, ok. I get it now.

I'm all for having more metrics/stats/info stored inside Accumulo inside 
(I mean, c'mon, we already have the distributed, reliable db part). I'm 
not sure what all we would want to store, but it sounds like a good idea 
to me to think about storage and query requirements and build an API for it.

On 2/13/14, 1:54 PM, Bill Havanki wrote:
> Part of what I did in ACCUMULO-2198 had the potential to increase failures.
>  From the commit message:
>
> In addition, the test logic would reset the timestamp every time servers
> were found
>   unbalanced, provided the 15-minute allowance hadn't expired. This commit
> fixes that
> issue as well. This could lead to more, correct, reports of unbalanced
> servers.
>
>
> I've been wondering if we just don't need a more robust way of determining
> that servers have been unbalanced for too long, even so far as drawing
> historical information from the servers themselves instead of relying on
> random points in time in the test.
>
>
> On Thu, Feb 13, 2014 at 1:46 PM, Josh Elser <josh.elser@gmail.com> wrote:
>
>> Nifty - I was just running a single Concurrent client and got hit by the
>> unbalanced exception.
>>
>> Makes me wonder if something changed from December/early January when I
>> was running 1.6 test much more heavily.
>>
>>
>> On 2/10/14, 11:33 AM, Josh Elser wrote:
>>
>>> On 2/10/14, 10:33 AM, Bill Havanki wrote:
>>>
>>>> I used the standard agitation intervals. I don't understand enough about
>>>> the system yet to ascertain why tablets stayed unbalanced. One
>>>> possibility
>>>> is the timing of the checks and how that interacted with the 15-minute
>>>> time
>>>> allowance and minimum count:
>>>>
>>>> 1. The first failure condition occurred at 11:36, starting the 15-minute
>>>> clock.
>>>> 2. The second failure condition was at the next check 30 minutes later.
>>>> 3. A rapid succession of checks in the next two minutes pushed the
>>>> failure
>>>> count up high enough.
>>>>
>>>> It's possible that the tablets became balanced, and then unbalanced
>>>> again,
>>>> between steps 1 and 2, so the time allowance was defeated.
>>>>
>>>
>>> Precisely. You could easily have gotten "bad luck" and had some splits
>>> right before one of these balances checks which pushed you out of
>>> balance. Diagnosing the "why" here is definitely an annoyance but good
>>> to do to make sure you didn't stumble on a bug. Typically cross-ref'ing
>>> the RW logs to the master log is sufficient to figure out what was
>>> happening.
>>>
>>>   Anyway, I restarted the randomwalk and it ran successfully for over 24
>>>> hours with agitation.
>>>>
>>>>
>>>> On Sun, Feb 9, 2014 at 7:25 PM, Josh Elser <josh.elser@gmail.com> wrote:
>>>>
>>>>   Interesting - I think I might have run into that once a whole bunch
>>>>> of RW
>>>>> runs.
>>>>>
>>>>> I assume you didn't change the agitation intervals from what's in the
>>>>> example? The parameters as they stand are, I think, acceptable. Being
>>>>> unbalanced for that long doesn't seem right. Did you identify why you
>>>>> were
>>>>> unbalanced?
>>>>>
>>>>> I'm not sure making that configurable is good either as you're now
>>>>> skewing
>>>>> one randomwalk test to another (in addition to the variance you already
>>>>> have from resources available).
>>>>>
>>>>> Personally, if you run into this, and you can identify that there was
a
>>>>> legitimate reason to be unbalanced across that time and those checks,
>>>>> I'd
>>>>> be more in favor of just restarting that RW client.
>>>>>
>>>>>
>>>>> On 2/8/14, 11:50 AM, Bill Havanki wrote:
>>>>>
>>>>>   While running 1.5.1 rc1 through randomwalk I hit a failure in the
>>>>>> Concurrent test due to the tablet servers being "unbalanced". See
>>>>>> ACCUMULO-2198 for some background on the last time I ran into this.
>>>>>>
>>>>>> What is the general feeling on dealing with this failure? Is a
>>>>>> 15-minute
>>>>>> period too short to wait for balancing, or three consecutive
>>>>>> failures too
>>>>>> few to allow? I'm using only a 7-node cluster with 5 tservers, maybe
an
>>>>>> unbalanced condition is more tolerable then?
>>>>>>
>>>>>> The parameters defining "unbalanced" aren't configurable at the moment,
>>>>>> and
>>>>>> I'm inclined to file a JIRA to make them so, to shepherd the test
>>>>>> through,
>>>>>> but I'd love to hear what you think about the importance and proper
>>>>>> parameters for this check.
>>>>>>
>>>>>> Thanks,
>>>>>> Bill
>>>>>>
>>>>>>
>>>>>>
>>>>
>>>>
>
>

Mime
View raw message