accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Moss <michael.m...@gmail.com>
Subject Re: 1 of 20 TServers unresponsive/slow, all writes fail?
Date Fri, 09 Sep 2016 14:44:44 GMT
1.7.2 (client still 1.6.2).

I think its an overall design issue, no? Serving metadata is a SPOF?

On Fri, Sep 9, 2016 at 10:41 AM, Christopher <ctubbsii@apache.org> wrote:

> What version of Accumulo? Could narrow down the search for known issue
> potentials.
>
> On Fri, Sep 9, 2016 at 10:36 AM Michael Moss <michael.moss@gmail.com>
> wrote:
>
>> Upon further internal discussion, it looks like the metadata/root tables
>> are served from the tservers (not an HA master for example) and the one in
>> question was serving it. It was unable to run MajC (compaction) for many
>> hours leading up to the time where it couldn't service requests any longer,
>> but it was still up, hosting tablets, just very slow or unable to respond.
>> So all writes ended up timing out.
>>
>> If this condition is possible and there is a SPOF here, it'd be good to
>> see what's on the roadmap to address it.
>>
>> On Fri, Sep 9, 2016 at 10:24 AM, <dlmarion@comcast.net> wrote:
>>
>>> What was happening on that 1 tserver? Was it in garbage collection? Was
>>> it having network or O/S issues?
>>>
>>> ------------------------------
>>> *From: *"Michael Moss (BLOOMBERG/ 731 LEX)" <mmoss19@bloomberg.net>
>>> *To: *user@accumulo.apache.org
>>> *Sent: *Friday, September 9, 2016 9:40:42 AM
>>> *Subject: *1 of 20 TServers unresponsive/slow, all writes fail?
>>>
>>>
>>> Hi,
>>>
>>> We are starting to investigate an issue where 1 tserver was up, but
>>> became slow/unresponsive for several hours, yet all writes to our 20+
>>> servers began to fail. We could see leading up to the failure that the
>>> writes were distributed among all of the tablet servers, so it wasn't a
>>> hotspot. Whenever we receive a MutationsRejectedException, we recreate the
>>> BatchWriter (ACCUMULO-2990). I'm digging into the TabletServerBatchWriter
>>> code, but any ideas what could cause this issue? Is there some sort of
>>> initialization or healthchecking that the client does where 1 server could
>>> impact all?
>>>
>>> Thanks.
>>>
>>> -Mike
>>>
>>> Caused by: org.apache.accumulo.core.client.TimedOutException: Servers
>>> timed out [pnj-bvlt-r4n03.abc.com:31113] at org.apache.accumulo.core.
>>> client.impl.TabletServerBatchWriter$TimeoutTracker.wroteNothing(
>>> TabletServerBatchWriter.java:177) ~[stormjar.jar:1.0] at
>>> org.apache.accumulo.core.client.impl.TabletServerBatchWriter$
>>> TimeoutTracker.errorOccured(TabletServerBatchWriter.java:182)
>>> ~[stormjar.jar:1.0] at org.apache.accumulo.core.client.impl.
>>> TabletServerBatchWriter$MutationWriter.sendMutationsToTabletServer(
>>> TabletServerBatchWriter.java:933) ~[stormjar.jar:1.0] at
>>>
>>>
>>

Mime
View raw message