Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 127A7200B6B for ; Fri, 9 Sep 2016 16:44:53 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 0DEC7160AC2; Fri, 9 Sep 2016 14:44:53 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 31D8F160AA3 for ; Fri, 9 Sep 2016 16:44:52 +0200 (CEST) Received: (qmail 56471 invoked by uid 500); 9 Sep 2016 14:44:51 -0000 Mailing-List: contact user-help@accumulo.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@accumulo.apache.org Delivered-To: mailing list user@accumulo.apache.org Received: (qmail 56461 invoked by uid 99); 9 Sep 2016 14:44:51 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 09 Sep 2016 14:44:51 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id DFB10C0361 for ; Fri, 9 Sep 2016 14:44:50 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.18 X-Spam-Level: * X-Spam-Status: No, score=1.18 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_LOW=-0.7, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, SPF_PASS=-0.001, WEIRD_PORT=0.001] autolearn=disabled Authentication-Results: spamd4-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id ztba9eDBUWkB for ; Fri, 9 Sep 2016 14:44:48 +0000 (UTC) Received: from mail-oi0-f45.google.com (mail-oi0-f45.google.com [209.85.218.45]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id 5643E5F1E9 for ; Fri, 9 Sep 2016 14:44:47 +0000 (UTC) Received: by mail-oi0-f45.google.com with SMTP id s131so138030495oie.2 for ; Fri, 09 Sep 2016 07:44:47 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc; bh=Dh34kHJrODGLokn4RfM6HrAM3YYtC3MUzTFVJv4PZls=; b=WpiwXyGRXvzfobk9z34gUy474wHCmM03OfEyKobndbuULUeH5pC+PYd3TygQzzaqrK +q/JZP1yKd2GUuttUn1+6cd9+FQ69nUKYrdcmCYs5D+IIhdbDlBf4utinZe6GGZ2QXBm huxtq7k1MTPBxL+SIF7XePbq8tawwczdajkTIwxtqQJbMA1Pqt5MtK633N/vZ+8MQRC/ Wqb9604gzyIw63xxijW2TcwO6qRLVGN2KfIidM13OOzsXFlaqor5jbfD+eJ+MHRJF9oD 8vc+/DmZp+kVOJe5pSETOpXKRA7M2vzq7liF5aK0k77Is2CgWhbWgxDjIPTd/WmJgj87 ePWw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:cc; bh=Dh34kHJrODGLokn4RfM6HrAM3YYtC3MUzTFVJv4PZls=; b=i/bl3pQROO78e67046McnbyN14BiCI4vA9E4MtymPrhgn3HbXg9UnGt1UsBKoW3f+e 1RWw20usBdzIOM1c/LlFJhHT1Vscgc1vPTMEFmAY80W1lJruhhOKe6m7USxAgd/0WoWV A7BkO4kV/V/G8ViNMY0zLgfQARx3wbyrf3rKvP6BEvV4yVf3Bjb3wETL9Y9QrW8myplX 6NjpV7607pLWusRaoYcGMQ9ONFbyvUZlExH1/eufbOAW8jhUmVaDgfzVmt/dCX7kzmcC E/Io4FxadP1kJouLskCqEpk2pVvdtM3TX5i9JSSTnMCeyK7zvJ1tVwwV9ng2eGYWbJI6 MllQ== X-Gm-Message-State: AE9vXwPupUDZwC55+nq7u+01SX9OhL63g8T5Cya0KGazVADDkcArn/tron6u/AvxDrYWdMZi2GA6Ba4A9/mRgw== X-Received: by 10.202.213.9 with SMTP id m9mr5489156oig.63.1473432284679; Fri, 09 Sep 2016 07:44:44 -0700 (PDT) MIME-Version: 1.0 Received: by 10.79.0.94 with HTTP; Fri, 9 Sep 2016 07:44:44 -0700 (PDT) In-Reply-To: References: <57D2BBDA005E0484003902FA_0_27570@msclnjpmsgsv02> <805160347.15893084.1473431074977.JavaMail.zimbra@comcast.net> From: Michael Moss Date: Fri, 9 Sep 2016 10:44:44 -0400 Message-ID: Subject: Re: 1 of 20 TServers unresponsive/slow, all writes fail? To: user@accumulo.apache.org Cc: Michael Moss Content-Type: multipart/alternative; boundary=001a113deb0eedd624053c142f0d archived-at: Fri, 09 Sep 2016 14:44:53 -0000 --001a113deb0eedd624053c142f0d Content-Type: text/plain; charset=UTF-8 1.7.2 (client still 1.6.2). I think its an overall design issue, no? Serving metadata is a SPOF? On Fri, Sep 9, 2016 at 10:41 AM, Christopher wrote: > What version of Accumulo? Could narrow down the search for known issue > potentials. > > On Fri, Sep 9, 2016 at 10:36 AM Michael Moss > wrote: > >> Upon further internal discussion, it looks like the metadata/root tables >> are served from the tservers (not an HA master for example) and the one in >> question was serving it. It was unable to run MajC (compaction) for many >> hours leading up to the time where it couldn't service requests any longer, >> but it was still up, hosting tablets, just very slow or unable to respond. >> So all writes ended up timing out. >> >> If this condition is possible and there is a SPOF here, it'd be good to >> see what's on the roadmap to address it. >> >> On Fri, Sep 9, 2016 at 10:24 AM, wrote: >> >>> What was happening on that 1 tserver? Was it in garbage collection? Was >>> it having network or O/S issues? >>> >>> ------------------------------ >>> *From: *"Michael Moss (BLOOMBERG/ 731 LEX)" >>> *To: *user@accumulo.apache.org >>> *Sent: *Friday, September 9, 2016 9:40:42 AM >>> *Subject: *1 of 20 TServers unresponsive/slow, all writes fail? >>> >>> >>> Hi, >>> >>> We are starting to investigate an issue where 1 tserver was up, but >>> became slow/unresponsive for several hours, yet all writes to our 20+ >>> servers began to fail. We could see leading up to the failure that the >>> writes were distributed among all of the tablet servers, so it wasn't a >>> hotspot. Whenever we receive a MutationsRejectedException, we recreate the >>> BatchWriter (ACCUMULO-2990). I'm digging into the TabletServerBatchWriter >>> code, but any ideas what could cause this issue? Is there some sort of >>> initialization or healthchecking that the client does where 1 server could >>> impact all? >>> >>> Thanks. >>> >>> -Mike >>> >>> Caused by: org.apache.accumulo.core.client.TimedOutException: Servers >>> timed out [pnj-bvlt-r4n03.abc.com:31113] at org.apache.accumulo.core. >>> client.impl.TabletServerBatchWriter$TimeoutTracker.wroteNothing( >>> TabletServerBatchWriter.java:177) ~[stormjar.jar:1.0] at >>> org.apache.accumulo.core.client.impl.TabletServerBatchWriter$ >>> TimeoutTracker.errorOccured(TabletServerBatchWriter.java:182) >>> ~[stormjar.jar:1.0] at org.apache.accumulo.core.client.impl. >>> TabletServerBatchWriter$MutationWriter.sendMutationsToTabletServer( >>> TabletServerBatchWriter.java:933) ~[stormjar.jar:1.0] at >>> >>> >> --001a113deb0eedd624053c142f0d Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
1.7.2 (client still 1.6.2).

I think its= an overall design issue, no? Serving metadata is a SPOF?

On Fri, Sep 9, 2016 at = 10:41 AM, Christopher <ctubbsii@apache.org> wrote:
What version of Accumulo? Could= narrow down the search for known issue potentials.

On Fr= i, Sep 9, 2016 at 10:36 AM Michael Moss <michael.moss@gmail.com> wrote:
Upon further internal disc= ussion, it looks like the metadata/root tables are served from the tservers= (not an HA master for example) and the one in question was serving it. It = was unable to run MajC (compaction) for many hours leading up to the time w= here it couldn't service requests any longer, but it was still up, host= ing tablets, just very slow or unable to respond. So all writes ended up ti= ming out.

If this condition is possible and there is a S= POF here, it'd be good to see what's on the roadmap to address it.<= /div>

On Fri= , Sep 9, 2016 at 10:24 AM, <dlmarion@comcast.net> wrote:=
What was happening on that 1 tserver? Was i= t in garbage collection? Was it having network or O/S issues?


From: "Michael Moss (BLOOMBERG/ 731 LEX)" <mmoss19@bloomberg.net>
To:
user@accumulo.apache.org
Sent: Friday, September 9, 2= 016 9:40:42 AM
Subject: 1 of 20 TServers unresponsive/slow, all w= rites fail?


Hi,

= We are starting to investigate an issue where 1 tserver was up, but became = slow/unresponsive for several hours, yet all writes to our 20+ servers bega= n to fail. We could see leading up to the failure that the writes were dist= ributed among all of the tablet servers, so it wasn't a hotspot. Whenev= er we receive a MutationsRejectedException, we recreate the BatchWriter (AC= CUMULO-2990). I'm digging into the=C2=A0TabletServerBatchWriter code, b= ut any ideas what could cause this issue? Is there some sort of initializat= ion or healthchecking that the client does where 1 server could impact all?=

Thanks.

-Mike
=
Caused by: org.apache.accumulo.core.client.TimedOutExce= ption: Servers timed out [pnj-bvlt-r4n03.abc.com:31113] at org.apache.accumulo.c= ore.client.impl.TabletServerBatchWriter$TimeoutTracker.wrote= Nothing(TabletServerBatchWriter.java:177) ~[stormjar.jar:1.0] at = org.apache.accumulo.core.client.impl.TabletServerBatchWriter$TimeoutTracker.errorOccured(TabletServerBatchWriter.java:182) ~[= stormjar.jar:1.0] at org.apache.accumulo.core.client.impl.TabletS= erverBatchWriter$MutationWriter.sendMutationsToTabletServer(= TabletServerBatchWriter.java:933) ~[stormjar.jar:1.0] at
<= /div>



--001a113deb0eedd624053c142f0d--