helix-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Lance Co Ting Keh <la...@box.com>
Subject Re: General Architecture built around Helix
Date Sun, 23 Jun 2013 21:18:25 GMT
Hi Kishore,

Hope you are having a restful weekend. I was just wondering when I should
normally expect the bug fix to go through?

Thank you very much,

On Tue, Jun 18, 2013 at 1:36 PM, Lance Co Ting Keh <lance@box.com> wrote:

> Thanks Kishore, here is the link to the bug:
> https://issues.apache.org/jira/browse/HELIX-131
> On Tue, Jun 18, 2013 at 9:13 AM, kishore g <g.kishore@gmail.com> wrote:
>> My bad, i dint realize that you needed helixadmin to actually create the
>> cluster.  Please file a bug, fix it quite simple.
>> thanks,
>> Kishore G
>> On Tue, Jun 18, 2013 at 9:00 AM, Lance Co Ting Keh <lance@box.com> wrote:
>>> Thanks Kishore. Would you like me to file a bug fix for the first
>>> solution?
>>> Also with the use of the factory, i get the following error message:
>>> [error] org.apache.helix.HelixException: Initial cluster structure is
>>> not set up for cluster: dev-box-cluster
>>> Seems it did not create the appropriate zNodes for me. was there
>>> something i was suppose to initialize before calling the factory?
>>> Thank you
>>> Lance
>>> On Mon, Jun 17, 2013 at 8:09 PM, kishore g <g.kishore@gmail.com> wrote:
>>>> Hi Lance,
>>>> Looks like we are not setting the connection timeout while connecting
>>>> to zookeeper in zkHelixAdmin.
>>>> Fix is to change line 99 in ZkHelixAdmin.java   _zkClient = newZkClient(zkAddress);
>>>> _zkClient = new ZkClient(zkAddress, timeout* 1000);
>>>> Another workaround is to use HelixManager to get HelixAdmin
>>>> manager = HelixManagerFactory.getZKHelixManager(cluster, "Admin",
>>>> InstanceType.ADMINISTRATOR, zkAddress);
>>>> manager.connect();
>>>> admin= manager. getClusterManagmentTool();
>>>> This will wait for 60 seconds before failing.
>>>> Thanks,
>>>> Kishore G
>>>> On Mon, Jun 17, 2013 at 6:15 PM, Lance Co Ting Keh <lance@box.com>wrote:
>>>>> Thank you kishore. I'll definitely try the memory consumption of one
>>>>> JVM per node.js server first. If its too much we'll likely do your proposed
>>>>> design but execute kills via the OS. This is to ensure no rogue servers.
>>>>> I have a small implementation question. when calling new ZkHelixAdmin,
>>>>> when it fails it retries again and again infinitely. (val admin = new
>>>>> ZKHelixAdmin("")) is there a method I can override to limit the number
>>>>> reconnects and just have it fail?
>>>>> Lance
>>>>> On Sun, Jun 16, 2013 at 11:56 PM, kishore g <g.kishore@gmail.com>wrote:
>>>>>> Hi Lance,
>>>>>> Looks good to me. Having a JVM per node.js server might add
>>>>>> additional over head, you should definitely run this with production
>>>>>> configuration and ensure that it does not impact performanace. If
you find
>>>>>> it consuming too many resources, you can probably try this approach.
>>>>>>    1. Have one agent per node
>>>>>>    2. Instead of creating a separate helix agent per node.js, you
>>>>>>    can create a multiple participants within the same agent. Each
>>>>>>    will represents node.js process.
>>>>>>    3. The monitoring of participant LIVEINSTANCE and killing of
>>>>>>    node.js process can be done by one of the helix agents. You create
>>>>>>    another resource using leader-standby model. Only one helix agent
will be
>>>>>>    the leader and it will monitor the LIVEINSTANCES and if any Helix
>>>>>>    dies it can ask node.js servers to kill itself( you can use http
or any
>>>>>>    other mechanism of your choice). The idea here is to designate
one leader
>>>>>>    in the system to ensure that helix-agent and node.js act like
a pair.
>>>>>> You can try this only if you find that overhead of JVM is significant
>>>>>> with the approach you have listed.
>>>>>> Thanks,
>>>>>> Kishore G
>>>>>> On Fri, Jun 14, 2013 at 8:37 PM, Lance Co Ting Keh <lance@box.com>wrote:
>>>>>>> Thank you for your advise Santiago. That is certainly part of
>>>>>>> design as well.
>>>>>>> Best,
>>>>>>> Lance
>>>>>>> On Fri, Jun 14, 2013 at 5:32 PM, Santiago Perez <
>>>>>>> santip@santip.com.ar> wrote:
>>>>>>>> Helix user here (not developer) so take my words with a grain
>>>>>>>> salt.
>>>>>>>> Regarding 6 you might want to consider the behavior of the
>>>>>>>> instance if that instance loses connection to zk, you'll
probably want to
>>>>>>>> kill it too, otherwise you could ignore the fact that the
JVM lost the
>>>>>>>> connection too.
>>>>>>>> Regards,
>>>>>>>> Santiago
>>>>>>>> On Fri, Jun 14, 2013 at 6:30 PM, Lance Co Ting Keh <lance@box.com>wrote:
>>>>>>>>> We have a working prototype of basically something like
#2 you
>>>>>>>>> proposed above. We're using the standard helix participant,
and on the
>>>>>>>>> @Transitions of the state model send commands to node.js
via Http.
>>>>>>>>> I want to run you through our general architecture to
make sure we
>>>>>>>>> are not violating anything on the Helix side. As a reminder,
what we need
>>>>>>>>> to guarantee is that an any given time one and only one
node.js process is
>>>>>>>>> in charge of a task.
>>>>>>>>> 1. A machine with N cores will have N (pending testing)
>>>>>>>>> processes running
>>>>>>>>> 2. Associated with each of the N node processes are also
N Helix
>>>>>>>>> participants (separate JVM instances -- reason for this
to come later)
>>>>>>>>> 3. Separate helix controller will be running on the machine
>>>>>>>>> will just leader elect between machines.
>>>>>>>>> 4. The spectator router will likely be HAProxy and thus
a linux
>>>>>>>>> kernel will run JVM to serve as Helix spectator
>>>>>>>>> 5. The state machine for each will simply be ONLINEOFFLINE
>>>>>>>>> (however i do get error messages that say that i havent
defined an OFFLINE
>>>>>>>>> to DROPPED mode, i was going to ask you this but this
is a minor detail
>>>>>>>>> compared to the rest of the architecture)
>>>>>>>>> 5. Simple Bash script will serve as a watch dog on each
>>>>>>>>> and helix participant pair. If any of the two are "dead"
the other process
>>>>>>>>> must immediately be SIGKILLED, hence the need for one
JVM serving as Helix
>>>>>>>>> Participant for every Node.js
>>>>>>>>> 6. Each node.js instance sets a watch on /LIVEINSTANCES
>>>>>>>>> to zookeeper as an extra safety blanket. If it finds
that it is NOT in the
>>>>>>>>> liveinstances it likely means that its JVM participant
lost its connection
>>>>>>>>> to Zookeeper, but the process is still running so the
bash script has not
>>>>>>>>> terminated the node server. In this case the node server
must end its own
>>>>>>>>> process.
>>>>>>>>> Thank you for all your help.
>>>>>>>>> Sincerely,
>>>>>>>>> Lance
>>>>>>>>> On Wed, Jun 12, 2013 at 9:07 PM, kishore g <g.kishore@gmail.com>wrote:
>>>>>>>>>> Hi Lance,
>>>>>>>>>> Thanks for your interest in Helix. There are two
>>>>>>>>>> approaches
>>>>>>>>>> 1. Similar to what you suggested: Write a Helix Participant
>>>>>>>>>> non-jvm language which in your case is node.js. There
seem to be quite a
>>>>>>>>>> few implementations in node.js that can interact
with zookeeper. Helix
>>>>>>>>>> participant does the following ( you got it right
but i am providing right
>>>>>>>>>> sequence)
>>>>>>>>>>    1. Create an ephemeral node under LIVEINSTANCES
node for
>>>>>>>>>>    transitions
>>>>>>>>>>    3. After transition is completed it updates
>>>>>>>>>> Controller is doing most of the heavy lifting of
ensuring that
>>>>>>>>>> these transitions lead to the desired configuration.
Its quite easy to
>>>>>>>>>> re-implement this in any other language, the most
difficult thing would be
>>>>>>>>>> zookeeper binding. We have used java bindings and
its solid.
>>>>>>>>>> This is at a very high level, there are some more
details I have
>>>>>>>>>> left out like handling connection loss/session expiry
etc that will require
>>>>>>>>>> some thinking.
>>>>>>>>>> 2. The other option is to use the Helix-agent as
a proxy: We
>>>>>>>>>> added Helix agent as part of 0.6.1, we havent documented
it yet. Here is
>>>>>>>>>> the gist of what it does. Think of it as a generic
state transition
>>>>>>>>>> handler. You can configure Helix to run a specific
system command as part
>>>>>>>>>> of each transition. Helix agent is a separate process
that runs along side
>>>>>>>>>> your actual process. Instead of the actual process
getting the transition,
>>>>>>>>>> Helix Agent gets the transition. As part of this
transition the Helix agent
>>>>>>>>>> can invoke api's on the actual process via RPC, HTTP
etc. Helix agent
>>>>>>>>>> simply acts as a proxy to the actual process.
>>>>>>>>>> I have another approach and will try to write it
up tonight, but
>>>>>>>>>> before that I have few questions
>>>>>>>>>>    1. How many node.js servers run on each node one
or >1
>>>>>>>>>>    2. Spectator/router is java or non java based
>>>>>>>>>>    3. Can you provide more details about your state
>>>>>>>>>> thanks,
>>>>>>>>>> Kishore G
>>>>>>>>>> On Wed, Jun 12, 2013 at 11:07 AM, Lance Co Ting Keh
>>>>>>>>>> lance@box.com> wrote:
>>>>>>>>>>> Hi my name is Lance Co Ting Keh and I work at
Box. You guys did
>>>>>>>>>>> a tremendous job with Helix. We are looking to
use it to manage a cluster
>>>>>>>>>>> primarily running Node.js. Our model for using
Helix would be
>>>>>>>>>>> to have node.js or some other non-JVM library
be *Participants*,
>>>>>>>>>>> a router as a *Spectator* and another set of
machines to serve
>>>>>>>>>>> as the *Controllers *(pending testing we may
just run
>>>>>>>>>>> master-slave controllers on the same instances
as the Participants) . The
>>>>>>>>>>> participants will be interacting with Zookeeper
in two ways, one is to
>>>>>>>>>>> receive helix state transition messages through
the instance of the
>>>>>>>>>>> HelixManager <Participant>, and another
is to directly interact with
>>>>>>>>>>> Zookeeper just to maintain ephemeral nodes within
/INSTANCES. Maintaining
>>>>>>>>>>> ephemeral nodes directly to Zookeeper would be
done instead of using
>>>>>>>>>>> InstanceConfig and calling addInstance on HelixAdmin
because of the basic
>>>>>>>>>>> health checking baked into maintaining ephemeral
nodes. If not we would
>>>>>>>>>>> then have to write a health checker from Node.js
and the JVM running the
>>>>>>>>>>> Participant. Are there better alternatives for
non-JVM Helix participants?
>>>>>>>>>>> I corresponded with Kishore briefly and he mentioned
>>>>>>>>>>> specifically ProcessMonitorThread that came out
in the last release.
>>>>>>>>>>> Thank you very much!
>>>>>>>>>>>  Lance Co Ting Keh

View raw message