accumulo-notifications mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Shawn Walker (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (ACCUMULO-4353) Stabilize tablet assignment during transient failure
Date Fri, 24 Jun 2016 15:15:16 GMT

    [ https://issues.apache.org/jira/browse/ACCUMULO-4353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15348409#comment-15348409
] 

Shawn Walker commented on ACCUMULO-4353:
----------------------------------------

bq. Are you attempting to design a mechanism that could be used to avoid re-balancing and
have the master keep assignments where they were previously, knowing that servers will come
back into operation?
That is the idea, yes.  There is definitely a tradeoff here.

bq. If this is really about trying to make rolling-restarts better, I'd encourage a look at
ACCUMULO-1454.
As I mentioned before, I hadn't seen ACCUMULO-1454 before starting this.  I've now looked
at the discussion of that ticket.  What I implemented was approximately what Christopher Tubbs
and David Medinets were suggesting.  I also read through Keith Turner's design proposal summary.
 I have some reservations with it:
* It requires that each planned restart involves tablet servers changing ports.  While the
recent changes to Accumulo to support a narrow port range during port search would make this
more plausible, it might still prove difficult to establish firewall rules for Accumulo. 
(Sean Busby raises this issue in the discussion).
* What happens if a tablet is split after migration starts?  It seems to me there might be
a race condition here which would lead to incomplete migration between sibling tablet servers.
 Do we block assignment during the rolling restart, too?  That seems seems like a cure worse
than the problem.
* Even barring those two concerns, I again raise the spectre of ops complexity.  To transition
a single server, I need to know (a) which port the "old" tserver was running on, and (b) which
port the "new" tserver is running on.  If I'm using some sort of dynamic port assignment (which
I would need to unless I pointed the "new" tserver at an entirely different configuration),
it could be non-trivial to gather these pieces of information.  While the burden on the operator
of a cluster of 5 tservers might not be significant, the burden on the operator of a cluster
of 200 tservers might make this approach infeasible. And the non-triviality of determining
the correct port migration mapping would also make the process difficult to robustly automate.

bq. While seeing a pull request accompanying the issue reported, It seems a bit premature
to me to see code without some discussion on what the problems are and how best to solve them.
Ahh, my mistake then.  As a new contributor to Accumulo, I'm still don't have a full grasp
of the rules, either written or unwritten.  My feeling from watching the list was that primary
modus operandi was to present a (fully implemented) solution along with a proposed problem,
and then to discuss the merits of the solution.


> Stabilize tablet assignment during transient failure
> ----------------------------------------------------
>
>                 Key: ACCUMULO-4353
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-4353
>             Project: Accumulo
>          Issue Type: Improvement
>            Reporter: Shawn Walker
>            Assignee: Shawn Walker
>            Priority: Minor
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> When a tablet server dies, Accumulo attempts to reassign the tablets it was hosting as
quickly as possible to maintain availability.  If multiple tablet servers die in quick succession,
such as from a rolling restart of the Accumulo cluster or a network partition, this behavior
can cause a storm of reassignment and rebalancing, placing significant load on the master.
> To avert such load, Accumulo should be capable of maintaining a steady tablet assignment
state in the face of transient tablet server loss.  Instead of reassigning tablets as quickly
as possible, Accumulo should be await the return of a temporarily downed tablet server (for
some configurable duration) before assigning its tablets to other tablet servers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message