accumulo-notifications mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Keith Turner (JIRA)" <>
Subject [jira] [Commented] (ACCUMULO-1454) Need good way to perform a rolling restart of all tablet servers
Date Thu, 23 May 2013 19:41:22 GMT


Keith Turner commented on ACCUMULO-1454:

Seems like you would want an exception for metadata tablets?  Reassign those tablets immediately.

bq.  For most cases, we could even do neat stuff like wait for all scans to cease for a tablet
before we migrate it away.

Currently when a tablet is closed, it interrupts running scans.   Waiting for scans to finish
is tricky, because it seems like you would not want to allow new scans to start.  So while
the scans running before close would see no delay, scans started after close will still see
a delay.

bq. I think that's the key to an elegant solution here: ensure a delay long enough for the
tserver to come back and continue serving the tablets it had been

Could possibly record this tablet state in the metadata table as opposed to keeping it in
the master memory.   So put something in the metadata table for a tablet indicates the master
should delay assigning a tablet until a tablet server becomes active.  If the master does
not see a tablet server for a period of time, it could ignore those entries in the metadata
table and assign.
> Need good way to perform a rolling restart of all tablet servers
> ----------------------------------------------------------------
>                 Key: ACCUMULO-1454
>                 URL:
>             Project: Accumulo
>          Issue Type: Improvement
>          Components: tserver
>    Affects Versions: 1.5.0, 1.4.3
>            Reporter: Mike Drob
> When needing to change a tserver parameter (e.g. java heap space) across the entire cluster,
there is not a graceful way to perform a rolling restart.
> The naive approach of just killing tservers one at a time causes a lot of churn on the
cluster as tablets move around and zookeeper tries to maintain current state.
> Potential solutions might be via a fancy fate operation, with coordination by the master.
Ideally, the master would know which servers are 'safe' to restart and could minimize overall
impact during the operation.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see:

View raw message