accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Christopher <ctubb...@apache.org>
Subject Re: 1 of 20 TServers unresponsive/slow, all writes fail?
Date Fri, 09 Sep 2016 14:41:56 GMT
What version of Accumulo? Could narrow down the search for known issue
potentials.

On Fri, Sep 9, 2016 at 10:36 AM Michael Moss <michael.moss@gmail.com> wrote:

> Upon further internal discussion, it looks like the metadata/root tables
> are served from the tservers (not an HA master for example) and the one in
> question was serving it. It was unable to run MajC (compaction) for many
> hours leading up to the time where it couldn't service requests any longer,
> but it was still up, hosting tablets, just very slow or unable to respond.
> So all writes ended up timing out.
>
> If this condition is possible and there is a SPOF here, it'd be good to
> see what's on the roadmap to address it.
>
> On Fri, Sep 9, 2016 at 10:24 AM, <dlmarion@comcast.net> wrote:
>
>> What was happening on that 1 tserver? Was it in garbage collection? Was
>> it having network or O/S issues?
>>
>> ------------------------------
>> *From: *"Michael Moss (BLOOMBERG/ 731 LEX)" <mmoss19@bloomberg.net>
>> *To: *user@accumulo.apache.org
>> *Sent: *Friday, September 9, 2016 9:40:42 AM
>> *Subject: *1 of 20 TServers unresponsive/slow, all writes fail?
>>
>>
>> Hi,
>>
>> We are starting to investigate an issue where 1 tserver was up, but
>> became slow/unresponsive for several hours, yet all writes to our 20+
>> servers began to fail. We could see leading up to the failure that the
>> writes were distributed among all of the tablet servers, so it wasn't a
>> hotspot. Whenever we receive a MutationsRejectedException, we recreate the
>> BatchWriter (ACCUMULO-2990). I'm digging into the TabletServerBatchWriter
>> code, but any ideas what could cause this issue? Is there some sort of
>> initialization or healthchecking that the client does where 1 server could
>> impact all?
>>
>> Thanks.
>>
>> -Mike
>>
>> Caused by: org.apache.accumulo.core.client.TimedOutException: Servers
>> timed out [pnj-bvlt-r4n03.abc.com:31113] at
>> org.apache.accumulo.core.client.impl.TabletServerBatchWriter$TimeoutTracker.wroteNothing(TabletServerBatchWriter.java:177)
>> ~[stormjar.jar:1.0] at
>> org.apache.accumulo.core.client.impl.TabletServerBatchWriter$TimeoutTracker.errorOccured(TabletServerBatchWriter.java:182)
>> ~[stormjar.jar:1.0] at
>> org.apache.accumulo.core.client.impl.TabletServerBatchWriter$MutationWriter.sendMutationsToTabletServer(TabletServerBatchWriter.java:933)
>> ~[stormjar.jar:1.0] at
>>
>>
>

Mime
View raw message