accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From dlmar...@comcast.net
Subject Re: 1 of 20 TServers unresponsive/slow, all writes fail?
Date Fri, 09 Sep 2016 15:05:25 GMT

We have seen this before: a tserver that is hosting metadata tablets has issues and starts
causing problems within the cluster. You could try using the HostRegexTableLoadBalancer[1,2]
to segregate your metadata tablets from the other tables. This doesn't fully eliminate the
SPOF, but it should help to ensure that the tablet servers hosting the metadata tablets are
not busy doing work for other tables. 

To do this you would do the following in the shell, then restart the master: 

1) Set the 'master.tablet.balancer' property to the HostRegexTableLoadBalancer class name

2) Set the property 'table.custom.balancer.host.regex.accumulo.metadata=<regex>' 
3) Set other HostRegexTableLoadBalancer properties if desired 

[1] https://issues.apache.org/jira/browse/ACCUMULO-4173 
[2] https://github.com/apache/accumulo/blob/rel/1.7.2/server/base/src/main/java/org/apache/accumulo/server/master/balancer/HostRegexTableLoadBalancer.java


----- Original Message -----

From: "Michael Moss" <michael.moss@gmail.com> 
To: user@accumulo.apache.org 
Cc: "Michael Moss" <mmoss19@bloomberg.net> 
Sent: Friday, September 9, 2016 10:44:44 AM 
Subject: Re: 1 of 20 TServers unresponsive/slow, all writes fail? 

1.7.2 (client still 1.6.2). 

I think its an overall design issue, no? Serving metadata is a SPOF? 

On Fri, Sep 9, 2016 at 10:41 AM, Christopher < ctubbsii@apache.org > wrote: 



What version of Accumulo? Could narrow down the search for known issue potentials. 

On Fri, Sep 9, 2016 at 10:36 AM Michael Moss < michael.moss@gmail.com > wrote: 

<blockquote>

Upon further internal discussion, it looks like the metadata/root tables are served from the
tservers (not an HA master for example) and the one in question was serving it. It was unable
to run MajC (compaction) for many hours leading up to the time where it couldn't service requests
any longer, but it was still up, hosting tablets, just very slow or unable to respond. So
all writes ended up timing out. 

If this condition is possible and there is a SPOF here, it'd be good to see what's on the
roadmap to address it. 

On Fri, Sep 9, 2016 at 10:24 AM, < dlmarion@comcast.net > wrote: 

<blockquote>

What was happening on that 1 tserver? Was it in garbage collection? Was it having network
or O/S issues? 


From: "Michael Moss (BLOOMBERG/ 731 LEX)" < mmoss19@bloomberg.net > 
To: user@accumulo.apache.org 
Sent: Friday, September 9, 2016 9:40:42 AM 
Subject: 1 of 20 TServers unresponsive/slow, all writes fail? 


Hi, 

We are starting to investigate an issue where 1 tserver was up, but became slow/unresponsive
for several hours, yet all writes to our 20+ servers began to fail. We could see leading up
to the failure that the writes were distributed among all of the tablet servers, so it wasn't
a hotspot. Whenever we receive a MutationsRejectedException, we recreate the BatchWriter (ACCUMULO-2990).
I'm digging into the TabletServerBatchWriter code, but any ideas what could cause this issue?
Is there some sort of initialization or healthchecking that the client does where 1 server
could impact all? 

Thanks. 

-Mike 

Caused by: org.apache.accumulo.core.client.TimedOutException: Servers timed out [ pnj-bvlt-r4n03.abc.com:31113
] at org.apache.accumulo.core.client.impl.TabletServerBatchWriter$TimeoutTracker.wroteNothing(TabletServerBatchWriter.java:177)
~[stormjar.jar:1.0] at org.apache.accumulo.core.client.impl.TabletServerBatchWriter$TimeoutTracker.errorOccured(TabletServerBatchWriter.java:182)
~[stormjar.jar:1.0] at org.apache.accumulo.core.client.impl.TabletServerBatchWriter$MutationWriter.sendMutationsToTabletServer(TabletServerBatchWriter.java:933)
~[stormjar.jar:1.0] at 






</blockquote>


</blockquote>




Mime
View raw message