Mailing-List: contact dev-help@accumulo.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@accumulo.apache.org
Received-SPF: pass (athena.apache.org: local policy includes SPF record at
 spf.trusted-forwarder.org)
MIME-Version: 1.0
In-Reply-To: <52F81C7E.5040207@gmail.com>
References: 
 <CAD-fFUK8sBC-CaiWxWi0R-+QJE76xinHk=K+h69wcMBrp_YVFQ@mail.gmail.com>
 <52F81C7E.5040207@gmail.com>
From: Bill Havanki <bhavanki@clouderagovt.com>
Date: Mon, 10 Feb 2014 10:33:39 -0500
Message-ID: 
 <CAD-fFU+70+DZ1hg4MjgsUmdUsg6DdXhRbfxS-gv3SU55wE2mcw@mail.gmail.com>
Subject: Re: Randomwalk unbalanced servers - still an issue?
To: dev@accumulo.apache.org
Content-Type: multipart/alternative; boundary=001a11c1c5608c162e04f20f11c7

--001a11c1c5608c162e04f20f11c7
Content-Type: text/plain; charset=ISO-8859-1

I used the standard agitation intervals. I don't understand enough about
the system yet to ascertain why tablets stayed unbalanced. One possibility
is the timing of the checks and how that interacted with the 15-minute time
allowance and minimum count:

1. The first failure condition occurred at 11:36, starting the 15-minute
clock.
2. The second failure condition was at the next check 30 minutes later.
3. A rapid succession of checks in the next two minutes pushed the failure
count up high enough.

It's possible that the tablets became balanced, and then unbalanced again,
between steps 1 and 2, so the time allowance was defeated.

Anyway, I restarted the randomwalk and it ran successfully for over 24
hours with agitation.


On Sun, Feb 9, 2014 at 7:25 PM, Josh Elser <josh.elser@gmail.com> wrote:

> Interesting - I think I might have run into that once a whole bunch of RW
> runs.
>
> I assume you didn't change the agitation intervals from what's in the
> example? The parameters as they stand are, I think, acceptable. Being
> unbalanced for that long doesn't seem right. Did you identify why you were
> unbalanced?
>
> I'm not sure making that configurable is good either as you're now skewing
> one randomwalk test to another (in addition to the variance you already
> have from resources available).
>
> Personally, if you run into this, and you can identify that there was a
> legitimate reason to be unbalanced across that time and those checks, I'd
> be more in favor of just restarting that RW client.
>
>
> On 2/8/14, 11:50 AM, Bill Havanki wrote:
>
>> While running 1.5.1 rc1 through randomwalk I hit a failure in the
>> Concurrent test due to the tablet servers being "unbalanced". See
>> ACCUMULO-2198 for some background on the last time I ran into this.
>>
>> What is the general feeling on dealing with this failure? Is a 15-minute
>> period too short to wait for balancing, or three consecutive failures too
>> few to allow? I'm using only a 7-node cluster with 5 tservers, maybe an
>> unbalanced condition is more tolerable then?
>>
>> The parameters defining "unbalanced" aren't configurable at the moment,
>> and
>> I'm inclined to file a JIRA to make them so, to shepherd the test through,
>> but I'd love to hear what you think about the importance and proper
>> parameters for this check.
>>
>> Thanks,
>> Bill
>>
>>


-- 
| - - -
| Bill Havanki
| Solutions Architect, Cloudera Government Solutions
| - - -

--001a11c1c5608c162e04f20f11c7--