Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 8BC23200B69 for ; Sat, 6 Aug 2016 01:54:27 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 8A5C9160AAC; Fri, 5 Aug 2016 23:54:27 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 85628160A8E for ; Sat, 6 Aug 2016 01:54:26 +0200 (CEST) Received: (qmail 61649 invoked by uid 500); 5 Aug 2016 23:54:25 -0000 Mailing-List: contact user-help@accumulo.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@accumulo.apache.org Delivered-To: mailing list user@accumulo.apache.org Received: (qmail 61639 invoked by uid 99); 5 Aug 2016 23:54:25 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 05 Aug 2016 23:54:25 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id 42DD8186220 for ; Fri, 5 Aug 2016 23:54:25 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.298 X-Spam-Level: * X-Spam-Status: No, score=1.298 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_LOW=-0.7, RCVD_IN_MSPIKE_H2=-0.001, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd3-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=timisrael-com.20150623.gappssmtp.com Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id bBHw_562ssfa for ; Fri, 5 Aug 2016 23:54:23 +0000 (UTC) Received: from mail-ua0-f175.google.com (mail-ua0-f175.google.com [209.85.217.175]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTPS id E02115F366 for ; Fri, 5 Aug 2016 23:54:22 +0000 (UTC) Received: by mail-ua0-f175.google.com with SMTP id i31so63287098uai.2 for ; Fri, 05 Aug 2016 16:54:22 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=timisrael-com.20150623.gappssmtp.com; s=20150623; h=mime-version:in-reply-to:references:from:date:message-id:subject:to; bh=IkB2+b45rau0upGDiLd4vHwLqtY3ewBHLzpaBGffiv4=; b=wv4yEQWJkQulQVI0isuMolP1CEF2LGFuFn6HE1p8PxVlxI7BuxpHOcE184ijT5Wr1/ SB7QyrwarcyCxb8v7rxGj+N2RV7n17UbGkzkOueITzPRpNaHifB+z4i+g4vKrgnKHcGK rHPoOmofO0p6V7TBpgjBeLsoxLKGagMZ4QpLv1x0iidL0wsAgsbmdyaF6d6xm8N9pgvP dfq//vyhw1Crv5q1Bxzuzv1QLosa+Zbfi3zLtBtt8YTkpCisQ97ujxlYcUy+AR8xP+9U fwTc5hx2c2szPK0Q/TcQ3m9njORAUgvn6kbb3T7GUDMzLfdSxma1SmpOG2XT8RCa0Ybh oU/Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to; bh=IkB2+b45rau0upGDiLd4vHwLqtY3ewBHLzpaBGffiv4=; b=CuKV3wNSOlkO27p4NhRVI6RCSdyv0fKvCLp2yPctI2txlTr1ZzGK85lUt9BworX2VP OgdsFn4iPcxnUyUOq1czmFoQY0c/Fexf1TK4Nxhi32bY68giaLPkFlmL8TRDSOU6iYwl y+wXtiLOH3PKziih27PIXcN9c0HtCSBPGyj9aCm10fJSn5ZX7LA3s26rS+AVJCB3cFLB IQYEa9kr93laMuyzjMqakI3Ax1oN6hDgvVNo3Soja9VGySkpo0OHBympsk5FOw3szUru LqZHpIQtMXJrQ+oQAGnpTsCWlFoETZ5TrHdLYV/+LqUIACBtfYHw2DPKpI8CCPZ361tw ShHw== X-Gm-Message-State: AEkoousdOQEmz8+o7DK3x2674G0jt31bXXUxWKl9/BJng6qII6dFMVeSpHKBreAkwPs4LLt54O7M9eYCvUGtcg== X-Received: by 10.176.3.52 with SMTP id 49mr42794492uat.0.1470441262350; Fri, 05 Aug 2016 16:54:22 -0700 (PDT) MIME-Version: 1.0 Received: by 10.31.74.130 with HTTP; Fri, 5 Aug 2016 16:54:01 -0700 (PDT) In-Reply-To: <57A4053C.2020605@gmail.com> References: <57A4053C.2020605@gmail.com> From: Tim I Date: Fri, 5 Aug 2016 19:54:01 -0400 Message-ID: Subject: Re: Persistent outstanding migrations message To: user@accumulo.apache.org Content-Type: multipart/alternative; boundary=001a113e1ed41b1c2c05395bc9b0 archived-at: Fri, 05 Aug 2016 23:54:27 -0000 --001a113e1ed41b1c2c05395bc9b0 Content-Type: text/plain; charset=UTF-8 That was a good idea Josh. IIRC, it was your post from 2015 that I found out about bouncing the Master because of possible old bugs. I checked the logs and found this: 18 16:53:55,535 [zookeeper.DistributedReadWriteLock] INFO : Added lock entry 1 userData 667becf32c0fe544 lockType READ 18 16:53:55,536 [tableOps.Utils] INFO : namespace +default (667becf32c0fe544) locked for read operation: COMPACT_CANCEL 18 16:53:55,542 [zookeeper.DistributedReadWriteLock] INFO : Added lock entry 0 userData 667becf32c0fe544 lockType READ 18 16:53:55,543 [tableOps.Utils] INFO : table 19 (667becf32c0fe544) locked for read operation: COMPACT_CANCEL I can't find record of the lock in zookeeper either. Will try to experiment more on Monday. I want to see if I can clear the logs, then wait for the migration warning, and finally repeat after deleting that outstanding fate operation (which does not seem to be tied anything). Thanks! Tim On Thu, Aug 4, 2016 at 11:17 PM, Josh Elser wrote: > FWIW, migrations that never go away have been a symptom of bugs in the > Master before. The master gets into a state where it either stops > processing migrations or it doesn't realize that there is a migration to > process. You might be able to grep over the Master log and find information > about migrations. Sorry I don't have anything more specific. > > The lock without a FATE op also seems problematic, but might be unrelated > to the migration? You might be able to find more information in the master > log about that FATE transaction ID. > > Michael Wall wrote: > >> Are you currently experiencing 1 outstanding migration? Does it go away >> on it's own? Unless servers are going down, tablets will migrate when >> their split threshold is reached. Is it possible you are constantly >> splitting a table? >> >> If all the tservers appear to be in good shape, maybe it is an issue >> with the master. What does the jstack look like for that? >> >> On Thu, Aug 4, 2016 at 12:06 PM, Tim I > > wrote: >> >> Hi Mike, >> >> Thanks for the direction. >> >> Empty result set from the scan you suggested >> >> There was a lock without an associated FATE operation. >> >> The following locks did not have an associated FATE operation >> txid: 667becf32c0fe544 locked: [R:+default] >> >> >> No recoveries stuck currently, and no long running scans. >> >> Otherwise, the system seems fine. >> >> Is it possible this is just benign? Should we monitor for locks >> that don't have FATE operations and delete them from time to time? >> >> Thanks, >> >> Tim >> >> On Thu, Aug 4, 2016 at 11:44 AM, Michael Wall > > wrote: >> >> Hi Tim, >> >> You can try scanning the metadata table for a future colfam. >> Something like >> >> scan -t accumulo.metadata -c fut >> >> If you find one, look at the tabletserver that is slated to host >> that tablet. There could be an issue with that server >> preventing assignment from completing. Get a jstack and save >> the logs so you can further troubleshoot. Killing that tserver >> will cause the assignment to go elsewhere, but make sure you get >> as much info as you can before killing it. >> >> What else is going on with the system? Do you have any >> recoveries that are stuck? Are there any fate transactions that >> have been running for a while? Any long running scans? >> >> HTH >> >> Mike >> >> On Thu, Aug 4, 2016 at 11:04 AM, Tim I > > wrote: >> >> Hi all, >> >> We're running accumulo 1.6.5 >> >> One of the issues we're seeing on a consistent basis is this >> message: >> >> "Not balancing due to 1 outstanding migrations". >> >> >> Is there a simple way to see the number of outstanding >> migrations? Based on what we've read and experienced, it >> eventually means we have to bounce the master to get things >> to a better state, however the message comes back within >> about 1 hour. >> >> Any thoughts and suggestions would be greatly appreciated. >> >> Thanks, >> >> Tim >> >> >> >> >> --001a113e1ed41b1c2c05395bc9b0 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
That was a good idea Josh.=C2=A0 IIRC, it was your post fr= om 2015 that I found out about bouncing the Master because of possible old = bugs. =C2=A0

I checked the logs and found this:=C2=A0

18 16:53:55,535 [zookeeper.DistributedReadWrit<= wbr>eLock] INFO : Added lock entry 1 userData 667becf32c0fe544 lockType REA= D
18 16:53:55,536 [tableOps.Utils] INFO : namespace +default (667= becf32c0fe544) locked for read operation: COMPACT_CANCEL
18 16:53= :55,542 [zookeeper.DistributedReadWriteLock] INFO : Added lock entry 0= userData 667becf32c0fe544 lockType READ
18 16:53:55,543 [tableOp= s.Utils] INFO : table 19 (667becf32c0fe544) locked for read operation: COMP= ACT_CANCEL

I can't find record of the lo= ck in zookeeper either.

Will try to experiment mor= e on Monday.

I want to see if I can clear the logs= , then wait for the migration warning, and finally repeat after deleting th= at outstanding fate operation (which does not seem to be tied anything).

Thanks!

Tim

On Thu, Aug 4, 2016 at 11:17 PM, Josh Elser = <josh.elser@gmail.com> wrote:
FWIW, migrations that never go away have been a symptom of bugs in t= he Master before. The master gets into a state where it either stops proces= sing migrations or it doesn't realize that there is a migration to proc= ess. You might be able to grep over the Master log and find information abo= ut migrations. Sorry I don't have anything more specific.

The lock without a FATE op also seems problematic, but might be unrelated t= o the migration? You might be able to find more information in the master l= og about that FATE transaction ID.

Michael Wall wrote:
Are you currently experiencing 1 outstanding migration?=C2=A0 Does it go aw= ay
on it's own?=C2=A0 Unless servers are going down, tablets will migrate = when
their split threshold is reached.=C2=A0 Is it possible you are constantly splitting a table?

If all the tservers appear to be in good shape, maybe it is an issue
with the master.=C2=A0 What does the jstack look like for that?

On Thu, Aug 4, 2016 at 12:06 PM, Tim I <tim@timisrael.com
<mailto:tim@timis= rael.com>> wrote:

=C2=A0 =C2=A0 Hi Mike,

=C2=A0 =C2=A0 Thanks for the direction.

=C2=A0 =C2=A0 Empty result set from the scan you suggested

=C2=A0 =C2=A0 There was a lock without an associated FATE operation.

=C2=A0 =C2=A0 =C2=A0 =C2=A0 The following locks did not have an associated = FATE operation
=C2=A0 =C2=A0 =C2=A0 =C2=A0 txid: 667becf32c0fe544=C2=A0 locked: [R:+defaul= t]


=C2=A0 =C2=A0 No recoveries stuck currently, and no long running scans.

=C2=A0 =C2=A0 Otherwise, the system seems fine.

=C2=A0 =C2=A0 Is it possible this is just benign?=C2=A0 Should we monitor f= or locks
=C2=A0 =C2=A0 that don't have FATE operations and delete them from time= to time?

=C2=A0 =C2=A0 Thanks,

=C2=A0 =C2=A0 Tim

=C2=A0 =C2=A0 On Thu, Aug 4, 2016 at 11:44 AM, Michael Wall <mjwall@gmail.com
<= span class=3D""> =C2=A0 =C2=A0 <mailto:mjwall@gmail.com>> wrote:

=C2=A0 =C2=A0 =C2=A0 =C2=A0 Hi Tim,

=C2=A0 =C2=A0 =C2=A0 =C2=A0 You can try scanning the metadata table for a f= uture colfam.
=C2=A0 =C2=A0 =C2=A0 =C2=A0 Something like

=C2=A0 =C2=A0 =C2=A0 =C2=A0 scan -t accumulo.metadata -c fut

=C2=A0 =C2=A0 =C2=A0 =C2=A0 If you find one, look at the tabletserver that = is slated to host
=C2=A0 =C2=A0 =C2=A0 =C2=A0 that tablet.=C2=A0 There could be an issue with= that server
=C2=A0 =C2=A0 =C2=A0 =C2=A0 preventing assignment from completing.=C2=A0 Ge= t a jstack and save
=C2=A0 =C2=A0 =C2=A0 =C2=A0 the logs so you can further troubleshoot.=C2=A0= Killing that tserver
=C2=A0 =C2=A0 =C2=A0 =C2=A0 will cause the assignment to go elsewhere, but = make sure you get
=C2=A0 =C2=A0 =C2=A0 =C2=A0 as much info as you can before killing it.

=C2=A0 =C2=A0 =C2=A0 =C2=A0 What else is going on with the system?=C2=A0 Do= you have any
=C2=A0 =C2=A0 =C2=A0 =C2=A0 recoveries that are stuck?=C2=A0 Are there any = fate transactions that
=C2=A0 =C2=A0 =C2=A0 =C2=A0 have been running for a while?=C2=A0 Any long r= unning scans?

=C2=A0 =C2=A0 =C2=A0 =C2=A0 HTH

=C2=A0 =C2=A0 =C2=A0 =C2=A0 Mike

=C2=A0 =C2=A0 =C2=A0 =C2=A0 On Thu, Aug 4, 2016 at 11:04 AM, Tim I <tim@timisrael.com =C2=A0 =C2=A0 =C2=A0 =C2=A0 <mailto:tim@timisrael.com>> wrote:

=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 Hi all,

=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 We're running accumulo 1.6.5<= br>
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 One of the issues we're seein= g on a consistent basis is this
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 message:

=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 "Not balancing= due to 1 outstanding migrations".


=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 Is there a simple way to see the = number of outstanding
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 migrations?=C2=A0 Based on what w= e've read and experienced, it
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 eventually means we have to bounc= e the master to get things
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 to a better state, however the me= ssage comes back within
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 about 1 hour.

=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 Any thoughts and suggestions woul= d be greatly appreciated.

=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 Thanks,

=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 Tim





--001a113e1ed41b1c2c05395bc9b0--