helix-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jiajun Wang (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (HELIX-659) Extend Helix to Support Resource with Multiple States
Date Mon, 10 Jul 2017 19:20:00 GMT

    [ https://issues.apache.org/jira/browse/HELIX-659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16030755#comment-16030755

Jiajun Wang edited comment on HELIX-659 at 7/10/17 7:19 PM:

Based on all that is discussed above, let us imagine a resource represented by 3 independent
state models: MasterSlave, ReadWrite, and Versions. The following figure shows three possible
state transitions for a replica of the resource.


Partition 1 has some internal error. So although it is still the master, it is transited to
"Error" state. Meantime, it's version needs to be upgraded.
Partition 2 is changed to "R/W". Probably because partition 1 is no longer servicing as an
"R/W" node.
As for partition 3, all its states are changed.

The difficulties of supporting this request using current Helix system include but not limited
to the following aspects.

*It is hard to define state machine or transition constraint for all state models using the
single state model*

For a dynamic state, pre-defined state model won't work at all.

But even we only consider regular state, there is still a problem. Based on our existing framework,
in order to support such scenario, we will need to create a very complex state model that
combines all 3 models. The result will be 2 * 3 * 4 = 24 states and around 80 possible transition
paths, which will be super hard to code.

*It will be potentially low efficient to do states transition*

Imagine that each state transition message contains the delta of a single state. The messages
should be as following.

Partitions	State transitions
R1	(Online, R/W, 1.0.1) → (Online, Error, 1.0.1)
	(Online, Error, 1.0.1) → (Online, Error, 1.0.2)
	(Online, Init, 1.0.1) → (Online, R/W, 1.0.1)
	(Offline, Init, 1.0.1) → (Online, Init, 1.0.1)
	(Online, Init, 1.0.1) → (Online, Ready, 1.0.1)
	(Online, Ready, 1.0.1) → (Online, Ready, 1.0.2)

Obviously, this strategy increases traffic and make the whole transition process much slower.
So a simpler design is that a message carries all necessary information.

Partitions	State transitions
R1	(Online, R/W, 1.0.1) → (Online, Error, 1.0.2)
R2	(Online, Init, 1.0.1) → (Online, R/W, 1.0.1)
R3	(Offline, Init, 1.0.1) → (Online, Ready, 1.0.2)

But this design brings other issues.

# When a participant gets a message, it may report the new states after finish all the changes.
Among all these states, if one state transition takes a considerably longer time than others,
the whole process is blocked.
# The controller has less control on how a participant does states transitions. It is a problem
if any policy like Helix State Transition Priority Support needs to be applied.
# On the other hand, the participant needs to check the message and compare status. It's hard
to ensure backward compatibility.

*Helix is not able to calculate the best possible state for every state model*

With dynamic state, we allow the application to manage state transition. So the state model
is not defined with a complete constraint and requirement. Helix cannot calculate the best
possible states.

Moreover, even for a nondynamic state, the application may want to trigger the transition
based on some external factors. In this case, Helix only coordinates the state transition.
But it won't make the best possible states plan.

In order to let the user define such states, we need to provide a new state model type. And
Helix should be able to interpret the definition and generate transition messages correctly.

h2. Additional Case Study

h3. Ambry R/W State

In Ambry, a partition has an "R/W" state in addition to OnlineOffline state. So the partition
The "R/W" state is for indicating whether this partition is for read-only or writable.
There may be state transitions as shown following.

* The first state transition is conducted by the Ambry application.
* The second one is regular state transition managed by Helix.


Note that the "R/W" state model is still regular model. Which means the state is pre-defined
and the constraint will still be defined as a regular state.

h3. Pinot Version State

In Pinot, when a new version of data is ready, the system replaces old partitions with the
new ones.
If the replacement is done one partition by another, any read that is queried during the upgrade
period will get inconsistent data.
Currently, the application needs a workaround for data consistency.

* Option 1, creating a new resource with l the test version and replace old resource after
the new one is ready.
* Option 2, maintaining customized configuration or property store item for managing versions
inside the application.

So the expected state transitions of a Pinot section is as follows.

It would be very helpful to extend Helix state transition system to support multiple state

h2. Proposal

In this document, we propose to extend existing state transition system in Helix. Basically,
Helix should allow one resource/partition to have more than one state. And the states are
managed separately based on different state models.

States transitions shall follow these rules:

* If only one state is changed, state transition logic keeps the same as what we have today.
* States have the different priorities. If more than one states are changed, Helix will finish
transition one by one based on state model priority. Transition messages are sent one after
* States may have the dependency. If state B depends on state A, transition on state B will
require state A's information. And if state A is in error state, state B transition will be
suspended. Otherwise, independent states transitions will not be blocked by each other.
* If the state is managed by the application, Helix won't calculate ideal state. The application
needs to specify the desired state in resource configurations.

h3. State Dependency and Priority

A complete multi-states definition will be a hierarchical system. The states are divided into
different levels. First tier states are the most important ones. And there might be additional
second level or third level states related to the higher level states. The states in the same
level will be independent to each other.

For example, Admins may set master/slave (MS) state as the first level state. And both R/W
state and Version shall depend on MS state.
That means transitions in R/W state or Version will require MS state as the input. And if
MS state is in error condition, no transition in the other states is allowed.
But R/W state and Version can be changed in parallel.


In addition to dependencies, Admins will be able to specify priorities for all related state
models. Basically, if multiple states are changed concurrently, Helix will process high priority
state transition first. As shown in the following figure, both R/W state and version are the
level 2 states. But if Admins configure version to have higher priority, Helix will schedule
it before R/W state.

h3. Application Managed State and Dynamic State

The nature of the dynamic state makes it an application managed state by default. However,
not all application managed state is dynamic states.


If we check the state model definition from different aspects, the differences between regular
state model and new state models are obvious.
Details about dynamic state design, and how to extend current state model interface will be
discussed as a separate topic. In this document, we only consider the simplest design for
supporting the basic features. More information is discussed in the "Design Details" section.
States	Transition Constraint	Next State
Regular state define	Fixed	State Machine	Helix decides new state
Dynamic state define	Dynamic	Check based on regex or no check	Application decides new state
Application managed state define	Both	Both	Application decides new state

h3. Multiple State Models vs. Single State Model

Shall we use multiple state models for every state, or defining a large state model which
is able to handle all states transition?

* In the first option, state models are completely treated equally. So state dependencies
have to be resolved by Helix. But it's easier for the application developers to define these
state models.
* In the second option, states relationship can be defined and resolved in the state model
class. So the management logic will be simplified. But defining constraints and state transition
rules will be difficult for the application developers.

In this design document, we will take the first option for limiting the change and ensuring
backward compatibility. But we may consider the other option in the future.

The whole feature implementation is divided into 2 phases.

# Support secondary states (Described in "First Mile Stone").
# Fully support multi-states with hierarchy structure and all feature support.

h2. The First Milestone

As the first milestone, we plan to add secondary states support as an optional feature.

The reason we don't implement the whole feature is one step is:

# Limit change for faster iteration.
# Ensure backward compatible until major version upgrade. For legacy participants, they won't
be able to handle complicated multi-states transition request.

h3. Secondary States

* The secondary states are configured separately but in the same way as the main state.
* The secondary states shall have different state models to avoid conflict. Also, they should
have different state models from the main state model.
* The secondary states will be level 2 states, while the main state is regarded as the level
1 state. Admins will be able to configure the secondary states as dynamic states. All secondary
states have the same priority.
* Helix doesn't calculate ideal state for the secondary states. Only updating in the resource
configuration will trigger secondary state transition. The state model can be a regular one
with constraints or dynamic state model.

The following figure demonstrates the workflow of secondary state registration and transition.
Note that except transition triggering, other major steps are the same as our existing state
transition mechanism.


was (Author: jiajunwang):
h1. Proposal
In this document, we propose to introduce an additional layer of state mechanism into Helix.
Considering Pinot case, what they need is transiting from "ONLINE:V1" to "ONLINE:V2". Note
that "V1" to "V2" transition is in parallel of the existing state transition. It is special
in following ways:
# The state is not pre-defined. New version numbers may appear after state transition model
is registered.
# Helix won't understand the internal logic of this additional state. So there is no way that
Helix automatically computes idea state. It will rely on application's configuration to update
this state.

We will take the above 2 points as assumptions.

As for expected workflow, still take Pinot partition version as an example: 
# Pinot needs to register their own logic for version upgrade, which means a new state model
(factory name).
# Helix provides API to configure resources with additional state ("VERSION").
# Upon resource configuration changed, the controller triggers state transition and sends
message to the participants.
# Participants handles message by calling corresponding state transition methods. Then update
in current state.
# Controller listens on current state change. If any update, it processes and reflects the
update in the external view.
h1. Design
h2. Register Associate States Model / Factory
Note that since associate states maybe not pre-defined, so defaultTransitionHandler has to
be implemented.
h3. State Model Factory:

public abstract class AssociateStateModelFactory extends StateModelFactory<AssociateStateModel>
public abstract class AssociateStateModel extends StateModel {
  static final String DEFAULT_INITIAL_STATE = "UNKNOWN";
  protected String _currentState = DEFAULT_INITIAL_STATE;
  public String getCurrentState() {
    return _currentState;
  // !!!!!!!!!!! Changed part !!!!!!!!!!!! //
  @transition(from='from', to='to')
  public void defaultTransitionHandler(Message message, NotificationContext context) {
      .error("Default transition handler. The idea is to invoke this if no transition method
is found. To be implemented");
  public boolean updateState(String newState) {
    _currentState = newState;
    return true;
  public void rollbackOnError(Message message, NotificationContext context,
      StateTransitionError error) {
    logger.error("Default rollback method invoked on error. Error Code: " + error.getCode());
  public void reset() {
      .warn("Default reset method invoked. Either because the process longer own this resource
or session timedout");
  @Transition(to = "DROPPED", from = "ERROR")
  public void onBecomeDroppedFromError(Message message, NotificationContext context)
      throws Exception {
    logger.info("Default ERROR->DROPPED transition invoked.");

h2. Resource Configuration
h3. Resource config with associate state VERSION:


h2. Additional APIs to configure associate states

 * Set configuration values
 * @param scope
 * @param properties
void setConfig(HelixConfigScope scope, Map<String, List<String>> listProperties);
 * Get configuration values
 * @param scope
 * @param keys
 * @return configuration values ordered by the provided keys
Map<String, List<String>> getConfig(HelixConfigScope scope, List<String>

h2. Partition with the Associate States on the Participant State And EV
h3. Current States:

      "ASSOCIATE_STATES":"1.0.1" // Split by ":" if multiple associate states are set

h3. Associate state in External View:

      // Given more than one assistant states, they will be split by ":". And the main state
will always be the first state.

h2. Helix Controller Updates
On resource configuration changes:
* Fill ClusterDataCache with associate states and related state models / factories from resource
* Merge associate states to BestPossibleStateOutput.
* Fill associate states and related state models / factories into the message before sending
to participants.

Note that batching all concurrent states change in one message can help to avoid parallel
state transitions. And if any error happens,  the processing will be stopped immediately,
so as to avoid further issue. This also means participants should handle multiple state transitions
An alternative design is sending separate messages on any of the states' change. This design
implies that states have no dependency. And there is no guarantee that the main state will
be handled before other associate states. It might be helpful in some conditions. But overall,
this alternative design brings more risk than benefit.

On participant state changes:
* Besides existing read, also read and fill associate states. Then fill EV with complete states

h2. Helix Participant Updates
On receiving state transition message:
* Read main state and associate states, trigger state transitions in order.
* Do main state transition first, then do associate states transitions one by one.
** If any state transition failed, set an error state to cover all states and stop processing.
User should fix problem and reset to initial states.
** If state transition succeeds, update current state.

h1. Alternative options
h2. Introducing UPGRADING State for additional state transitions
Adding a new internal state UPGRADING for partition upgrade.
So upgrade will happen when the partition is transited "to" or "from" UPGRADING status.
Note that application has the freedom to define whether UPGRADING is a special online status
or not.
For Pinot case, upgrading partition (even before they are back to ONLINE) might be active
The problem of this new state is that it only works fine for a single additional state.
Once we have more than one additional state to take care, UPGRADING state is not enough.
h2. Rely on resetting partition to load new states
Whenever a new version is available, application update versions for the resource. Then resetting
all partitions.
Then during state transition from offline to online, participants will read new version and
apply to the related partitions.
The problem of this method is changing in the additional state will affect the main state.
A partition will be offline for a while. During this period, even old version will be not
h2. Application registers message handler to handle upgrading message
In this method, the controller is only responsible for sending upgrade request to participants.
Participants will be responsible for reporting local participant versions.
Since the controller has no clue about how to control the additional state, the application
will need to process all the logics.
h1. Validation
Add unit tests / integration tests for validate associate states.
Verify Pinot Version use case.

> Extend Helix to Support Resource with Multiple States
> -----------------------------------------------------
>                 Key: HELIX-659
>                 URL: https://issues.apache.org/jira/browse/HELIX-659
>             Project: Apache Helix
>          Issue Type: New Feature
>          Components: helix-core
>    Affects Versions: 0.6.x
>            Reporter: Jiajun Wang
> h1. Problem Statement
> h2. Single State Model v.s. Multiple State Models
> Currently, Each Helix resource is associated with a single state model, and each replica
of a partition can only be in any one of these states defined in the state model at any time.
And Helix manages state transition based on the single state model.
> !https://documents.lucidchart.com/documents/e19ab04e-aa06-4ab3-9e57-cfe273554fa1/pages/0_0?a=2416&x=-11&y=71&w=517&h=198&store=1&accept=image%2F*&auth=LCA%20313ced8fb855e8fc1a7043f7fe91cdfa15fffb6b-ts%3D1498857664!
> However, in many scenarios, resources could be more complicated to be modeled by a single
state model.
> As an example, partitions from a resource could be described in different dimensions:
SlaveMaster state, Read or Write state and its versions. They represent different dimensions
of the overall resource status. States from each dimension are based on different state models.
Note that we have state machines simplified in this document.
> !https://documents.lucidchart.com/documents/e19ab04e-aa06-4ab3-9e57-cfe273554fa1/pages/0_0?a=2416&x=-71&y=66&w=1822&h=308&store=1&accept=image%2F*&auth=LCA%2041fa743ba130f41786dee3527de6206cebdd4534-ts%3D1498857664!
> The basic idea is that states in these 3 dimensions are in parallel and can be changed
independently. For instance, R/W state may be changed without updating slave/master state.
> h2. Finite State Machine v.s. Dynamic State Model
> In addition, Helix employs finite state machine to define a state model. However, some
state model can not be easily modeled by a finite state machine with fixed states, for example,
the versions.  We call such state model as the dynamic state model. It is read, set, and understood
by the application. We will need to extend Helix to support such dynamic state model. Note
that Helix should not and will not be able to calculate the best possible dynamic states.
> The version of a software is one of the best examples to understand dynamic state.
> Let's consider one application that is deployed on multiple nodes, which work together
as a cluster. The green node works as the master, and all dark blue nodes are slaves. When
Admins upgrades the service from 1.0.0 to 1.1.0, they need to ensure upgrading all nodes to
the new version and then claim upgrade is done. After the upgrade process, it is important
to ensure that all software versions are consistent.
> If Helix framework is leveraged to support upgrading the cluster, it will help to simplify
application logic and ensure consistency. For instance, the service (cluster) itself is regarded
as the resource. And each node is mapped as a partition. Then upgrading is simply a state
transition. Admins can check external view for ensuring consistency.
> Note that during this version upgrade, the master node is still master node, and slave
nodes are still slave nodes. So the version state is parallel to the other states.
> !https://documents.lucidchart.com/documents/e19ab04e-aa06-4ab3-9e57-cfe273554fa1/pages/0_0?a=2066&x=1466&y=922&w=560&h=455&store=1&accept=image%2F*&auth=LCA%20fa3d8fc0d113a82f4e94b127161cf91818a2fe64-ts%3D1497894598!

This message was sent by Atlassian JIRA

View raw message