Mailing-List: contact user-help@accumulo.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@accumulo.apache.org
MIME-Version: 1.0
References: <57D2BBDA005E0484003902FA_0_27570@msclnjpmsgsv02>
 <805160347.15893084.1473431074977.JavaMail.zimbra@comcast.net> <CAOiqz+Rv8co6LmjdCQCZUCXbt+36m2+TwZXp5h=Z0NZdJ8a7vg@mail.gmail.com>
In-Reply-To: <CAOiqz+Rv8co6LmjdCQCZUCXbt+36m2+TwZXp5h=Z0NZdJ8a7vg@mail.gmail.com>
From: Christopher <ctubbsii@apache.org>
Date: Fri, 09 Sep 2016 14:41:56 +0000
Message-ID: <CAL5zq9ak9jaPQcvoVz9HDjUE7MN6wiZQRw22t2NaN9WzQ0ZUZg@mail.gmail.com>
Subject: Re: 1 of 20 TServers unresponsive/slow, all writes fail?
To: user@accumulo.apache.org
Cc: Michael Moss <mmoss19@bloomberg.net>
Content-Type: multipart/alternative; boundary=001a1140fd9e8c02da053c142603
archived-at: Fri, 09 Sep 2016 14:42:10 -0000

--001a1140fd9e8c02da053c142603
Content-Type: text/plain; charset=UTF-8

What version of Accumulo? Could narrow down the search for known issue
potentials.

On Fri, Sep 9, 2016 at 10:36 AM Michael Moss <michael.moss@gmail.com> wrote:

> Upon further internal discussion, it looks like the metadata/root tables
> are served from the tservers (not an HA master for example) and the one in
> question was serving it. It was unable to run MajC (compaction) for many
> hours leading up to the time where it couldn't service requests any longer,
> but it was still up, hosting tablets, just very slow or unable to respond.
> So all writes ended up timing out.
>
> If this condition is possible and there is a SPOF here, it'd be good to
> see what's on the roadmap to address it.
>
> On Fri, Sep 9, 2016 at 10:24 AM, <dlmarion@comcast.net> wrote:
>
>> What was happening on that 1 tserver? Was it in garbage collection? Was
>> it having network or O/S issues?
>>
>> ------------------------------
>> *From: *"Michael Moss (BLOOMBERG/ 731 LEX)" <mmoss19@bloomberg.net>
>> *To: *user@accumulo.apache.org
>> *Sent: *Friday, September 9, 2016 9:40:42 AM
>> *Subject: *1 of 20 TServers unresponsive/slow, all writes fail?
>>
>>
>> Hi,
>>
>> We are starting to investigate an issue where 1 tserver was up, but
>> became slow/unresponsive for several hours, yet all writes to our 20+
>> servers began to fail. We could see leading up to the failure that the
>> writes were distributed among all of the tablet servers, so it wasn't a
>> hotspot. Whenever we receive a MutationsRejectedException, we recreate the
>> BatchWriter (ACCUMULO-2990). I'm digging into the TabletServerBatchWriter
>> code, but any ideas what could cause this issue? Is there some sort of
>> initialization or healthchecking that the client does where 1 server could
>> impact all?
>>
>> Thanks.
>>
>> -Mike
>>
>> Caused by: org.apache.accumulo.core.client.TimedOutException: Servers
>> timed out [pnj-bvlt-r4n03.abc.com:31113] at
>> org.apache.accumulo.core.client.impl.TabletServerBatchWriter$TimeoutTracker.wroteNothing(TabletServerBatchWriter.java:177)
>> ~[stormjar.jar:1.0] at
>> org.apache.accumulo.core.client.impl.TabletServerBatchWriter$TimeoutTracker.errorOccured(TabletServerBatchWriter.java:182)
>> ~[stormjar.jar:1.0] at
>> org.apache.accumulo.core.client.impl.TabletServerBatchWriter$MutationWriter.sendMutationsToTabletServer(TabletServerBatchWriter.java:933)
>> ~[stormjar.jar:1.0] at
>>
>>
>

--001a1140fd9e8c02da053c142603
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">What version of Accumulo? Could narrow down the search for=
 known issue potentials.</div><br><div class=3D"gmail_quote"><div dir=3D"lt=
r">On Fri, Sep 9, 2016 at 10:36 AM Michael Moss &lt;<a href=3D"mailto:micha=
el.moss@gmail.com">michael.moss@gmail.com</a>&gt; wrote:<br></div><blockquo=
te class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc so=
lid;padding-left:1ex"><div dir=3D"ltr">Upon further internal discussion, it=
 looks like the metadata/root tables are served from the tservers (not an H=
A master for example) and the one in question was serving it. It was unable=
 to run MajC (compaction) for many hours leading up to the time where it co=
uldn&#39;t service requests any longer, but it was still up, hosting tablet=
s, just very slow or unable to respond. So all writes ended up timing out.<=
div><br></div><div>If this condition is possible and there is a SPOF here, =
it&#39;d be good to see what&#39;s on the roadmap to address it.</div></div=
><div class=3D"gmail_extra"><br><div class=3D"gmail_quote">On Fri, Sep 9, 2=
016 at 10:24 AM,  <span dir=3D"ltr">&lt;<a href=3D"mailto:dlmarion@comcast.=
net" target=3D"_blank">dlmarion@comcast.net</a>&gt;</span> wrote:<br><block=
quote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc=
 solid;padding-left:1ex"><div><div style=3D"font-family:Arial;font-size:12p=
t;color:#000000"><div>What was happening on that 1 tserver? Was it in garba=
ge collection? Was it having network or O/S issues?<br></div><div><br></div=
><hr><div style=3D"color:#000;font-weight:normal;font-style:normal;text-dec=
oration:none;font-family:Helvetica,Arial,sans-serif;font-size:12pt"><b>From=
: </b>&quot;Michael Moss (BLOOMBERG/ 731 LEX)&quot; &lt;<a href=3D"mailto:m=
moss19@bloomberg.net" target=3D"_blank">mmoss19@bloomberg.net</a>&gt;<br><b=
>To: </b><a href=3D"mailto:user@accumulo.apache.org" target=3D"_blank">user=
@accumulo.apache.org</a><br><b>Sent: </b>Friday, September 9, 2016 9:40:42 =
AM<br><b>Subject: </b>1 of 20 TServers unresponsive/slow, all writes fail?<=
div><div><br><div><br></div><div style=3D"font-family:Arial;white-space:pre=
-wrap;font-size:small;color:rgb(0,0,0)">Hi,<div><br></div><div>We are start=
ing to investigate an issue where 1 tserver was up, but became slow/unrespo=
nsive for several hours, yet all writes to our 20+ servers began to fail. W=
e could see leading up to the failure that the writes were distributed amon=
g all of the tablet servers, so it wasn&#39;t a hotspot. Whenever we receiv=
e a MutationsRejectedException, we recreate the BatchWriter (ACCUMULO-2990)=
. I&#39;m digging into the=C2=A0TabletServerBatchWriter code, but any ideas=
 what could cause this issue? Is there some sort of initialization or healt=
hchecking that the client does where 1 server could impact all?</div><div><=
br></div><div>Thanks.</div><div><br></div><div>-Mike</div><div><br></div><d=
iv>Caused by: org.apache.accumulo.core.client.TimedOutException: Servers ti=
med out [<a href=3D"http://pnj-bvlt-r4n03.abc.com:31113" target=3D"_blank">=
pnj-bvlt-r4n03.abc.com:31113</a>] at org.apache.accumulo.core.client.impl.T=
abletServerBatchWriter$TimeoutTracker.wroteNothing(TabletServerBatchWriter.=
java:177) ~[stormjar.jar:1.0] at org.apache.accumulo.core.client.impl.Table=
tServerBatchWriter$TimeoutTracker.errorOccured(TabletServerBatchWriter.java=
:182) ~[stormjar.jar:1.0] at org.apache.accumulo.core.client.impl.TabletSer=
verBatchWriter$MutationWriter.sendMutationsToTabletServer(TabletServerBatch=
Writer.java:933) ~[stormjar.jar:1.0] at </div></div></div></div></div><div>=
<br></div></div></div></blockquote></div><br></div>
</blockquote></div>

--001a1140fd9e8c02da053c142603--