Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 3669C200B6B for ; Fri, 9 Sep 2016 16:42:10 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 34F5B160AC2; Fri, 9 Sep 2016 14:42:10 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 57F36160AA3 for ; Fri, 9 Sep 2016 16:42:09 +0200 (CEST) Received: (qmail 49730 invoked by uid 500); 9 Sep 2016 14:42:08 -0000 Mailing-List: contact user-help@accumulo.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@accumulo.apache.org Delivered-To: mailing list user@accumulo.apache.org Received: (qmail 49715 invoked by uid 99); 9 Sep 2016 14:42:08 -0000 Received: from mail-relay.apache.org (HELO mail-relay.apache.org) (140.211.11.15) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 09 Sep 2016 14:42:08 +0000 Received: from mail-vk0-f47.google.com (mail-vk0-f47.google.com [209.85.213.47]) by mail-relay.apache.org (ASF Mail Server at mail-relay.apache.org) with ESMTPSA id DD25D1A0187 for ; Fri, 9 Sep 2016 14:42:07 +0000 (UTC) Received: by mail-vk0-f47.google.com with SMTP id f76so70668439vke.0 for ; Fri, 09 Sep 2016 07:42:07 -0700 (PDT) X-Gm-Message-State: AE9vXwP/tucxXdkRx1DL+5vrf68pPGpXKiTo6IlzzHm7OJbU9aVKwoXpRg6EFNv2KoKf0yjbfNctxf8KiQMoxw== X-Received: by 10.31.153.17 with SMTP id b17mr2623215vke.142.1473432127269; Fri, 09 Sep 2016 07:42:07 -0700 (PDT) MIME-Version: 1.0 References: <57D2BBDA005E0484003902FA_0_27570@msclnjpmsgsv02> <805160347.15893084.1473431074977.JavaMail.zimbra@comcast.net> In-Reply-To: From: Christopher Date: Fri, 09 Sep 2016 14:41:56 +0000 X-Gmail-Original-Message-ID: Message-ID: Subject: Re: 1 of 20 TServers unresponsive/slow, all writes fail? To: user@accumulo.apache.org Cc: Michael Moss Content-Type: multipart/alternative; boundary=001a1140fd9e8c02da053c142603 archived-at: Fri, 09 Sep 2016 14:42:10 -0000 --001a1140fd9e8c02da053c142603 Content-Type: text/plain; charset=UTF-8 What version of Accumulo? Could narrow down the search for known issue potentials. On Fri, Sep 9, 2016 at 10:36 AM Michael Moss wrote: > Upon further internal discussion, it looks like the metadata/root tables > are served from the tservers (not an HA master for example) and the one in > question was serving it. It was unable to run MajC (compaction) for many > hours leading up to the time where it couldn't service requests any longer, > but it was still up, hosting tablets, just very slow or unable to respond. > So all writes ended up timing out. > > If this condition is possible and there is a SPOF here, it'd be good to > see what's on the roadmap to address it. > > On Fri, Sep 9, 2016 at 10:24 AM, wrote: > >> What was happening on that 1 tserver? Was it in garbage collection? Was >> it having network or O/S issues? >> >> ------------------------------ >> *From: *"Michael Moss (BLOOMBERG/ 731 LEX)" >> *To: *user@accumulo.apache.org >> *Sent: *Friday, September 9, 2016 9:40:42 AM >> *Subject: *1 of 20 TServers unresponsive/slow, all writes fail? >> >> >> Hi, >> >> We are starting to investigate an issue where 1 tserver was up, but >> became slow/unresponsive for several hours, yet all writes to our 20+ >> servers began to fail. We could see leading up to the failure that the >> writes were distributed among all of the tablet servers, so it wasn't a >> hotspot. Whenever we receive a MutationsRejectedException, we recreate the >> BatchWriter (ACCUMULO-2990). I'm digging into the TabletServerBatchWriter >> code, but any ideas what could cause this issue? Is there some sort of >> initialization or healthchecking that the client does where 1 server could >> impact all? >> >> Thanks. >> >> -Mike >> >> Caused by: org.apache.accumulo.core.client.TimedOutException: Servers >> timed out [pnj-bvlt-r4n03.abc.com:31113] at >> org.apache.accumulo.core.client.impl.TabletServerBatchWriter$TimeoutTracker.wroteNothing(TabletServerBatchWriter.java:177) >> ~[stormjar.jar:1.0] at >> org.apache.accumulo.core.client.impl.TabletServerBatchWriter$TimeoutTracker.errorOccured(TabletServerBatchWriter.java:182) >> ~[stormjar.jar:1.0] at >> org.apache.accumulo.core.client.impl.TabletServerBatchWriter$MutationWriter.sendMutationsToTabletServer(TabletServerBatchWriter.java:933) >> ~[stormjar.jar:1.0] at >> >> > --001a1140fd9e8c02da053c142603 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
What version of Accumulo? Could narrow down the search for= known issue potentials.

On Fri, Sep 9, 2016 at 10:36 AM Michael Moss <michael.moss@gmail.com> wrote:
Upon further internal discussion, it= looks like the metadata/root tables are served from the tservers (not an H= A master for example) and the one in question was serving it. It was unable= to run MajC (compaction) for many hours leading up to the time where it co= uldn't service requests any longer, but it was still up, hosting tablet= s, just very slow or unable to respond. So all writes ended up timing out.<= div>
If this condition is possible and there is a SPOF here, = it'd be good to see what's on the roadmap to address it.

On Fri, Sep 9, 2= 016 at 10:24 AM, <dlmarion@comcast.net> wrote:
What was happening on that 1 tserver? Was it in garba= ge collection? Was it having network or O/S issues?


From= : "Michael Moss (BLOOMBERG/ 731 LEX)" <mmoss19@bloomberg.net>
To: user= @accumulo.apache.org
Sent: Friday, September 9, 2016 9:40:42 = AM
Subject: 1 of 20 TServers unresponsive/slow, all writes fail?<= div>


Hi,

We are start= ing to investigate an issue where 1 tserver was up, but became slow/unrespo= nsive for several hours, yet all writes to our 20+ servers began to fail. W= e could see leading up to the failure that the writes were distributed amon= g all of the tablet servers, so it wasn't a hotspot. Whenever we receiv= e a MutationsRejectedException, we recreate the BatchWriter (ACCUMULO-2990)= . I'm digging into the=C2=A0TabletServerBatchWriter code, but any ideas= what could cause this issue? Is there some sort of initialization or healt= hchecking that the client does where 1 server could impact all?
<= br>
Thanks.

-Mike

Caused by: org.apache.accumulo.core.client.TimedOutException: Servers ti= med out [= pnj-bvlt-r4n03.abc.com:31113] at org.apache.accumulo.core.client.impl.T= abletServerBatchWriter$TimeoutTracker.wroteNothing(TabletServerBatchWriter.= java:177) ~[stormjar.jar:1.0] at org.apache.accumulo.core.client.impl.Table= tServerBatchWriter$TimeoutTracker.errorOccured(TabletServerBatchWriter.java= :182) ~[stormjar.jar:1.0] at org.apache.accumulo.core.client.impl.TabletSer= verBatchWriter$MutationWriter.sendMutationsToTabletServer(TabletServerBatch= Writer.java:933) ~[stormjar.jar:1.0] at
=

--001a1140fd9e8c02da053c142603--