accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Elser <josh.el...@gmail.com>
Subject Re: 1 of 20 TServers unresponsive/slow, all writes fail?
Date Fri, 09 Sep 2016 15:35:06 GMT
In short, yes. This is mitigated by the fact the metadata table can be 
split into many tablets. As such, not all tables would be affected by a 
single metadata tablet being unreachable (Dave's solution helps here).

One possible solution which could be investigated is what HBase coined 
as "Timeline-Consistent High Available Reads"[1]. Essentially, in 
addition to the read-write Tablet (as is currently the case), there are 
one to many read-only copies of a Tablet. This helps mitigate the case 
where some data is unreachable due to TabletServer problems.

However, this idea does make me a little wary for use with the metadata 
table.

Trying to figure out what happened on that node and get you a solution 
would be my preferred path forward :)

[1] http://hbase.apache.org/book.html#arch.timelineconsistent.reads

Michael Moss wrote:
> 1.7.2 (client still 1.6.2).
>
> I think its an overall design issue, no? Serving metadata is a SPOF?
>
> On Fri, Sep 9, 2016 at 10:41 AM, Christopher <ctubbsii@apache.org
> <mailto:ctubbsii@apache.org>> wrote:
>
>     What version of Accumulo? Could narrow down the search for known
>     issue potentials.
>
>     On Fri, Sep 9, 2016 at 10:36 AM Michael Moss <michael.moss@gmail.com
>     <mailto:michael.moss@gmail.com>> wrote:
>
>         Upon further internal discussion, it looks like the
>         metadata/root tables are served from the tservers (not an HA
>         master for example) and the one in question was serving it. It
>         was unable to run MajC (compaction) for many hours leading up to
>         the time where it couldn't service requests any longer, but it
>         was still up, hosting tablets, just very slow or unable to
>         respond. So all writes ended up timing out.
>
>         If this condition is possible and there is a SPOF here, it'd be
>         good to see what's on the roadmap to address it.
>
>         On Fri, Sep 9, 2016 at 10:24 AM, <dlmarion@comcast.net
>         <mailto:dlmarion@comcast.net>> wrote:
>
>             What was happening on that 1 tserver? Was it in garbage
>             collection? Was it having network or O/S issues?
>
>             ------------------------------------------------------------------------
>             *From: *"Michael Moss (BLOOMBERG/ 731 LEX)"
>             <mmoss19@bloomberg.net <mailto:mmoss19@bloomberg.net>>
>             *To: *user@accumulo.apache.org <mailto:user@accumulo.apache.org>
>             *Sent: *Friday, September 9, 2016 9:40:42 AM
>             *Subject: *1 of 20 TServers unresponsive/slow, all writes fail?
>
>
>             Hi,
>
>             We are starting to investigate an issue where 1 tserver was
>             up, but became slow/unresponsive for several hours, yet all
>             writes to our 20+ servers began to fail. We could see
>             leading up to the failure that the writes were distributed
>             among all of the tablet servers, so it wasn't a hotspot.
>             Whenever we receive a MutationsRejectedException, we
>             recreate the BatchWriter (ACCUMULO-2990). I'm digging into
>             the TabletServerBatchWriter code, but any ideas what could
>             cause this issue? Is there some sort of initialization or
>             healthchecking that the client does where 1 server could
>             impact all?
>
>             Thanks.
>
>             -Mike
>
>             Caused by:
>             org.apache.accumulo.core.client.TimedOutException: Servers
>             timed out [pnj-bvlt-r4n03.abc.com:31113
>             <http://pnj-bvlt-r4n03.abc.com:31113>] at
>             org.apache.accumulo.core.client.impl.TabletServerBatchWriter$TimeoutTracker.wroteNothing(TabletServerBatchWriter.java:177)
>             ~[stormjar.jar:1.0] at
>             org.apache.accumulo.core.client.impl.TabletServerBatchWriter$TimeoutTracker.errorOccured(TabletServerBatchWriter.java:182)
>             ~[stormjar.jar:1.0] at
>             org.apache.accumulo.core.client.impl.TabletServerBatchWriter$MutationWriter.sendMutationsToTabletServer(TabletServerBatchWriter.java:933)
>             ~[stormjar.jar:1.0] at
>
>
>

Mime
View raw message