accumulo-notifications mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Josh Elser (JIRA)" <>
Subject [jira] [Commented] (ACCUMULO-1454) Need good way to perform a rolling restart of all tablet servers
Date Fri, 13 Feb 2015 18:50:14 GMT


Josh Elser commented on ACCUMULO-1454:

I think we chatted about this recently: there's an issue of handling newer versions of RFile
and WALs in the middle of a rolling restart.

1. Server1 is restarted as the new version
2. Server1 writes some new data
3. Server1 dies
4. Server2 (still old version) gets the tablets from Server1

We need to ensure that there is control to limit the new software from writing out new versions
of persistent files while there are still old versions of the software participating in the
instance. It's similar to finalizing an upgrade: after we're sure that all of the servers
have been upgraded and are functioning well, we can flip them over to using new messages/serialization
that the old versions aren't aware of.

This problem gets much easier after we get to using Thrift/PB for serializing things because
both of those can naturally read newer versions of messages they know about, ignoring the
new fields.

> Need good way to perform a rolling restart of all tablet servers
> ----------------------------------------------------------------
>                 Key: ACCUMULO-1454
>                 URL:
>             Project: Accumulo
>          Issue Type: Sub-task
>          Components: tserver
>    Affects Versions: 1.4.3, 1.5.0
>            Reporter: Mike Drob
>         Attachments: ACCUMULO-1454-proposal-01.adoc, ACCUMULO-1454-proposal-01.html
> When needing to change a tserver parameter (e.g. java heap space) across the entire cluster,
there is not a graceful way to perform a rolling restart.
> The naive approach of just killing tservers one at a time causes a lot of churn on the
cluster as tablets move around and zookeeper tries to maintain current state.
> Potential solutions might be via a fancy fate operation, with coordination by the master.
Ideally, the master would know which servers are 'safe' to restart and could minimize overall
impact during the operation.

This message was sent by Atlassian JIRA

View raw message