helix-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kanak Biscuitwala <kana...@hotmail.com>
Subject RE: helix rebalancing for multiple resources
Date Thu, 02 Jan 2014 05:26:00 GMT
Not sure I follow. Is your problem that Helix creates the cluster as a child of the root node
(e.g. /clusterName) while you would like it to be something else (e.g. /path/to/custom/root/clusterName)?
I'm also unclear about what you mean about discovering ZK servers. How would you be able to
leverage a path in ZK to discover ZK?
Right now Helix requires long-running ZK servers and assumes that you as the application know
how to connect to them (i.e. you know the hosts/ports). If that assumption holds, I believe
it should work independent of deployment (cloud provider, private datacenter, or anything
I'm not really sure what you're trying to adapt with the adapter. Could you clarify?
I'm on #apachehelix on freenode if that's more convenient.
Date: Wed, 1 Jan 2014 21:07:36 -0800
Subject: Re: helix rebalancing for multiple resources
From: vusilly@gmail.com
To: kanak.b@hotmail.com
CC: user@helix.incubator.apache.org

Yes, that is helpful.
Another big requirement that I forgot to mention is running this on a cloud service provider,
like AWS.  We already have shared zookeeper setup there with our own client.  Ideally, I could
inject a custom client for helix to use for operations, where the main differences we would
require is a custom top level path (/appname) that is required by our client, and that would
handle discovering and connecting to the zookeeper servers.

Is support for AWS and other cloud providers on the roadmap?
Also, for the short-term, do you see any complications in us creating an adapter client that
helix would use to bridge that gap?  Or would it be much more complicated than I am hoping


On Wed, Jan 1, 2014 at 8:36 PM, Kanak Biscuitwala <kanak.b@hotmail.com> wrote:

Resending since I realized you might not be registered on the user list yet. By the way, for
your specific use case, I would personally lean towards the CustomCodeRunner along with the
CUSTOMIZED IdealState rebalance mode. Then when nodes enter and exit, you can change the IdealState
yourself and Helix will fire the transitions. This will most easily give you the policy-driven
global view you're looking for.


Hi Vu,
Your understanding is basically correct. The controller will rebalance each resource in sequence,
at most one controller pipeline execution is going on at any one time, and there is no parallelism
within the controller pipeline (other than batch reading and writing the cluster at the beginning
and end).
Here are some things that may be of use to know:
1. You can plug in your own code to help decide how to rebalance your cluster in one of two
   - Using the CustomCodeRunner on the participant side so that you can update the IdealState
whenever the cluster changes: https://github.com/apache/incubator-helix/blob/helix-0.6.2-release/helix-core/src/main/java/org/apache/helix/participant/HelixCustomCodeRunner.java?source=c
   - Implementing a Rebalancer with USER_DEFINED rebalance mode: https://github.com/apache/incubator-helix/blob/helix-0.6.2-release/helix-core/src/main/java/org/apache/helix/controller/rebalancer/Rebalancer.java?source=c

In either case, Helix will still fire transitions according to constraints and react to node

2. Helix supports adding tags to nodes (via InstanceConfig), and specifying tags in each resource
IdealState. Then, a tagged resource will only be assigned to nodes with the corresponding
tag present.

3. You can specify max partitions per resource per node in the IdealState of the resource
(this should be 1 in your case)

4. You can combine any of the above 3 if that makes sense (e.g. change node tags whenever
a cluster change happens, thus constraining how Helix will assign everything)

Is that helpful?

KanakDate: Wed, 1 Jan 2014 20:31:56 -0800
Subject: helix rebalancing for multiple resources
From: vusilly@gmail.com

To: user@helix.incubator.apache.org

We're looking into creating something like a distributed task processing cluster.  We already
have existing code for the processing task on a single host.  So that results in stronger
restrictions on what we're doing:

- partitioned task A: single partition needs to be assigned to a single node and a node may
have only a single partitioned task

- another set of non-partitioned tasks (e.g. B, C, D) also needs to be assigned nodes, but
it would be most efficient of those tasks are assigned to separate nodes so any single node
has at most 1 task (either partitioned A, B, C, D, etc.)

This seems to require a global view of a tasks.  However, from the examples and the Rebalancer
code, it appears that the resource mappings/assignments are independent of each another. 
Is that correct?  If so, is Apache Helix the right framework for us, given the requirements

I saw that it might be possible to find the current resource assignment for other resources
during the rebalancing calculation methods, but I was then concerned about concurrency issues--if
the rebalance for task A and rebalance for B was computed at the same time.

Thanks for any and all feedback.

Vu Nguyen 		 	   		  

View raw message