Return-Path: X-Original-To: apmail-accumulo-dev-archive@www.apache.org Delivered-To: apmail-accumulo-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 24D8710413 for ; Mon, 10 Feb 2014 15:34:33 +0000 (UTC) Received: (qmail 73937 invoked by uid 500); 10 Feb 2014 15:34:32 -0000 Delivered-To: apmail-accumulo-dev-archive@accumulo.apache.org Received: (qmail 73743 invoked by uid 500); 10 Feb 2014 15:34:27 -0000 Mailing-List: contact dev-help@accumulo.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@accumulo.apache.org Delivered-To: mailing list dev@accumulo.apache.org Received: (qmail 73719 invoked by uid 99); 10 Feb 2014 15:34:26 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 10 Feb 2014 15:34:26 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: local policy includes SPF record at spf.trusted-forwarder.org) Received: from [209.85.216.174] (HELO mail-qc0-f174.google.com) (209.85.216.174) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 10 Feb 2014 15:34:20 +0000 Received: by mail-qc0-f174.google.com with SMTP id x13so10601031qcv.5 for ; Mon, 10 Feb 2014 07:33:59 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:content-type; bh=zLHmdyvHf2nDvt6NP3Xq9xDErCZwZmTirbJVv+jivVU=; b=SMq6DZVDecd1ap5YPnANCWaktdZuA8slqGgxm7/1MP5ZuHhbegBmZ3xG/UxLjTH2cD z6RWXgOSHYUshOCv6SutxmSwOMO4jy3mpKOs9vTGyLu3uPmO7lXj85LBE3pDQtuYfMYB IutW72yt5OJZ5NKU+dP6+nBPAYhA8qjjiXJqEDkl/hgC38Gf+1W/850eBFX9rvQHMrHC kiaAyRIKyrb16fkGLMWYBnj5tbPzBGqnbhLYxzI/2C1kcktxPbF8UaDzdeCOeuOLMHJ4 o+uACp6IMJ/gv10Fo1CWN8Z/ZyeTCSglZPTslYEixCkTRDATRJo9OCRM7bXdNiasXSxe qJIg== X-Gm-Message-State: ALoCoQmGE4saCAzBvEPNvAdu8fnY26UaysCsTDs/whleZMizAfGp+kjvTXoPIBpfKGZLYY0uD2sD X-Received: by 10.224.20.9 with SMTP id d9mr48571158qab.100.1392046439589; Mon, 10 Feb 2014 07:33:59 -0800 (PST) MIME-Version: 1.0 Received: by 10.229.122.195 with HTTP; Mon, 10 Feb 2014 07:33:39 -0800 (PST) In-Reply-To: <52F81C7E.5040207@gmail.com> References: <52F81C7E.5040207@gmail.com> From: Bill Havanki Date: Mon, 10 Feb 2014 10:33:39 -0500 Message-ID: Subject: Re: Randomwalk unbalanced servers - still an issue? To: dev@accumulo.apache.org Content-Type: multipart/alternative; boundary=001a11c1c5608c162e04f20f11c7 X-Virus-Checked: Checked by ClamAV on apache.org --001a11c1c5608c162e04f20f11c7 Content-Type: text/plain; charset=ISO-8859-1 I used the standard agitation intervals. I don't understand enough about the system yet to ascertain why tablets stayed unbalanced. One possibility is the timing of the checks and how that interacted with the 15-minute time allowance and minimum count: 1. The first failure condition occurred at 11:36, starting the 15-minute clock. 2. The second failure condition was at the next check 30 minutes later. 3. A rapid succession of checks in the next two minutes pushed the failure count up high enough. It's possible that the tablets became balanced, and then unbalanced again, between steps 1 and 2, so the time allowance was defeated. Anyway, I restarted the randomwalk and it ran successfully for over 24 hours with agitation. On Sun, Feb 9, 2014 at 7:25 PM, Josh Elser wrote: > Interesting - I think I might have run into that once a whole bunch of RW > runs. > > I assume you didn't change the agitation intervals from what's in the > example? The parameters as they stand are, I think, acceptable. Being > unbalanced for that long doesn't seem right. Did you identify why you were > unbalanced? > > I'm not sure making that configurable is good either as you're now skewing > one randomwalk test to another (in addition to the variance you already > have from resources available). > > Personally, if you run into this, and you can identify that there was a > legitimate reason to be unbalanced across that time and those checks, I'd > be more in favor of just restarting that RW client. > > > On 2/8/14, 11:50 AM, Bill Havanki wrote: > >> While running 1.5.1 rc1 through randomwalk I hit a failure in the >> Concurrent test due to the tablet servers being "unbalanced". See >> ACCUMULO-2198 for some background on the last time I ran into this. >> >> What is the general feeling on dealing with this failure? Is a 15-minute >> period too short to wait for balancing, or three consecutive failures too >> few to allow? I'm using only a 7-node cluster with 5 tservers, maybe an >> unbalanced condition is more tolerable then? >> >> The parameters defining "unbalanced" aren't configurable at the moment, >> and >> I'm inclined to file a JIRA to make them so, to shepherd the test through, >> but I'd love to hear what you think about the importance and proper >> parameters for this check. >> >> Thanks, >> Bill >> >> -- | - - - | Bill Havanki | Solutions Architect, Cloudera Government Solutions | - - - --001a11c1c5608c162e04f20f11c7--