accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Moss <michael.m...@gmail.com>
Subject Re: 1 of 20 TServers unresponsive/slow, all writes fail?
Date Fri, 09 Sep 2016 14:36:13 GMT
Upon further internal discussion, it looks like the metadata/root tables
are served from the tservers (not an HA master for example) and the one in
question was serving it. It was unable to run MajC (compaction) for many
hours leading up to the time where it couldn't service requests any longer,
but it was still up, hosting tablets, just very slow or unable to respond.
So all writes ended up timing out.

If this condition is possible and there is a SPOF here, it'd be good to see
what's on the roadmap to address it.

On Fri, Sep 9, 2016 at 10:24 AM, <dlmarion@comcast.net> wrote:

> What was happening on that 1 tserver? Was it in garbage collection? Was it
> having network or O/S issues?
>
> ------------------------------
> *From: *"Michael Moss (BLOOMBERG/ 731 LEX)" <mmoss19@bloomberg.net>
> *To: *user@accumulo.apache.org
> *Sent: *Friday, September 9, 2016 9:40:42 AM
> *Subject: *1 of 20 TServers unresponsive/slow, all writes fail?
>
>
> Hi,
>
> We are starting to investigate an issue where 1 tserver was up, but became
> slow/unresponsive for several hours, yet all writes to our 20+ servers
> began to fail. We could see leading up to the failure that the writes were
> distributed among all of the tablet servers, so it wasn't a hotspot.
> Whenever we receive a MutationsRejectedException, we recreate the
> BatchWriter (ACCUMULO-2990). I'm digging into the TabletServerBatchWriter
> code, but any ideas what could cause this issue? Is there some sort of
> initialization or healthchecking that the client does where 1 server could
> impact all?
>
> Thanks.
>
> -Mike
>
> Caused by: org.apache.accumulo.core.client.TimedOutException: Servers
> timed out [pnj-bvlt-r4n03.abc.com:31113] at org.apache.accumulo.core.
> client.impl.TabletServerBatchWriter$TimeoutTracker.wroteNothing(
> TabletServerBatchWriter.java:177) ~[stormjar.jar:1.0] at
> org.apache.accumulo.core.client.impl.TabletServerBatchWriter$
> TimeoutTracker.errorOccured(TabletServerBatchWriter.java:182)
> ~[stormjar.jar:1.0] at org.apache.accumulo.core.client.impl.
> TabletServerBatchWriter$MutationWriter.sendMutationsToTabletServer(
> TabletServerBatchWriter.java:933) ~[stormjar.jar:1.0] at
>
>

Mime
View raw message