incubator-s4-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Frank Zheng <bearzheng2...@gmail.com>
Subject Re: S4 Communication Layer using ZooKeeper
Date Tue, 09 Oct 2012 03:14:34 GMT
Hi,

Could anyone explain in detail "The topology is written to ZooKeeper, and
each running node watches the ZooKeeper node(s) to keep track of any
changes" ?

Without any standby nodes, could ZooKeeper start new node first when one
working active node fails?

Thanks.

On Wed, Oct 3, 2012 at 9:27 PM, Matthieu Morel <mmorel@apache.org> wrote:

> We rely on notifications from Zookeeper to detect nodes failures.
> Typically you would use extra standby nodes, that will pick the unassigned
> partitions upon failures.
>
> If you want to restart the same node, you could also do that automatically
> (with something like daemon tools for instance) but as Kishore mentioned,
> there is no guarantee that it will pick the same partition than before
>
> Anyway if a node is considered as failed, it might also be due to a
> hardware failure or network partition, so it might not make sense to
> restart the same S4 node on the same machine.
>
> More details about fault tolerance can be found here:
> https://cwiki.apache.org/**confluence/display/S4/Fault+**tolerance+in+S4<https://cwiki.apache.org/confluence/display/S4/Fault+tolerance+in+S4>
>
>
> Hope this helps,
>
> Matthieu
>
>
> On 10/3/12 9:59 AM, Frank Zheng wrote:
>
>> Hi Matthieu,
>>
>> If there is no "retry queue", is it possible to restart the node
>> automatically which dies previously?
>>
>> Thanks
>> Frank
>>
>> On Wed, Oct 3, 2012 at 3:45 PM, Matthieu Morel <mmorel@apache.org
>> <mailto:mmorel@apache.org>> wrote:
>>
>>     Hi,
>>
>>     Actually currently there is no "retry queue" (we had a prototype
>>     mechanism but removed it due to some implementation issues).
>>     Therefore if there is a node failure, you might lose some messages
>>     until a failover node is reassigned to the corresponding partition,
>>     and this assignment is notified to sender nodes.
>>
>>     However, even though you might lose some events during failover, you
>>     can recover previous state through the checkpointing mechanism.
>>
>>     Regards,
>>
>>
>>     Matthieu
>>
>>
>>
>>     On 10/3/12 8:37 AM, Frank Zheng wrote:
>>
>>         Hi Kishore,
>>
>>         Could you point out which configuration is related to the retry
>> size
>>         queue, e.g. the name?
>>
>>         Thanks.
>>         Frank
>>
>>         On Wed, Oct 3, 2012 at 2:30 PM, kishore g <g.kishore@gmail.com
>>         <mailto:g.kishore@gmail.com>
>>         <mailto:g.kishore@gmail.com <mailto:g.kishore@gmail.com>>>
wrote:
>>
>>              Yes, you can restart the nodes automatically that
>>         previously died.
>>              But keep in mind that events may be dropped for that
>>         partition until
>>              the node is restarted. I think we have some configuration
>>         on the
>>              retry size queue and you can configure this based on how
>>         long it
>>              would take to automatically restart the node.Make sure there
>> is
>>              enough memory on all nodes to keep the events in queue
>>         until the
>>              node comes back up.
>>
>>              Another thing to keep in mind is if multiple nodes fail and
>> you
>>              restart them it is probably not guaranteed to pick up the
>>         partition
>>              that it had picked up earlier.
>>
>>              On Tue, Oct 2, 2012 at 11:19 PM, Frank Zheng
>>              <bearzheng2011@gmail.com <mailto:bearzheng2011@gmail.**com<bearzheng2011@gmail.com>
>> >
>>         <mailto:bearzheng2011@gmail.__**com
>>
>>         <mailto:bearzheng2011@gmail.**com <bearzheng2011@gmail.com>>>>
>> wrote:
>>
>>                  Hi Kishore,
>>
>>                  This describes very clearly. Thank you a lot!
>>
>>                  Now I have another question.
>>                  When one active node dies, the standby node tries to
>>         grab the lock.
>>                  What if no standby nodes are allowed? Under this
>>         assumption, is
>>                  it possible to restart the node automatically which
>>         dies previously?
>>
>>                  Thanks.
>>                  Frank
>>
>>                  On Wed, Oct 3, 2012 at 12:51 PM, kishore g
>>         <g.kishore@gmail.com <mailto:g.kishore@gmail.com>
>>                  <mailto:g.kishore@gmail.com
>>
>>         <mailto:g.kishore@gmail.com>>> wrote:
>>
>>                      At a very high level, this is how cluster
>>         management works
>>
>>                      Each s4 cluster has a name space reserved
>>         /clustername in
>>                      zookeeper. There is an initial setup process where
>>         one or
>>                      many znodes are created under /clustername/tasks.
>>         When nodes
>>                      join the cluster they check if some one has already
>>         claimed
>>                      a task by looking at /clustername/process/, if not
>>         it grabs
>>                      the lock by creating an ephemeral node under
>>                      /clustername/process/. If all tasks are taken it
>>         becomes a
>>                      standby node. When any active node dies, the
>>         standby node
>>                      gets notified and tries to grab the lock.
>>
>>                      We can provide more details, if you can let us know
>>         which
>>                      aspect of cluster management mechanism you are
>>         interested in.
>>
>>                      Thanks,
>>                      Kishore G
>>
>>                      On Tue, Oct 2, 2012 at 9:17 PM, Frank Zheng
>>                      <bearzheng2011@gmail.com
>>         <mailto:bearzheng2011@gmail.**com <bearzheng2011@gmail.com>>
>>         <mailto:bearzheng2011@gmail.__**com <mailto:bearzheng2011@gmail.*
>> *com <bearzheng2011@gmail.com>>>>
>>
>>
>>                      wrote:
>>
>>                          Hi All,
>>
>>                          I am exploring the cluster management mechanism
>> and
>>                          fault tolerance of S4.
>>                          I saw that S4 used ZooKeeper in the
>>         communication layer.
>>                          But it seems not very clear in that pater, " S4:
>>                          Distributed Stream Computing Platform".
>>                          I tried to search the reference "[15]
>> Communication
>>                          layer using ZooKeeper, Yahoo! Inc. Tech. Rep.,
>>         2009",
>>                          but it is not available.
>>                          Could anyone introduce me the role of ZooKeeper
>>         in S4,
>>                          and the cluster management mechanism in detail?
>>
>>                          Thanks.
>>
>>                          Sincerely,
>>                          Frank
>>
>>
>>
>>
>>
>>
>>
>>                  --
>>                  Sincerely,
>>                  Zheng Yu
>>                  Mobile:  (852) 60670059
>>                  Email: bearzheng2011@gmail.com
>>         <mailto:bearzheng2011@gmail.**com <bearzheng2011@gmail.com>>
>>         <mailto:bearzheng2011@gmail.__**com <mailto:bearzheng2011@gmail.*
>> *com <bearzheng2011@gmail.com>>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>         --
>>         Sincerely,
>>         Zheng Yu
>>         Mobile:  (852) 60670059
>>         Email: bearzheng2011@gmail.com <mailto:bearzheng2011@gmail.**com<bearzheng2011@gmail.com>
>> >
>>         <mailto:bearzheng2011@gmail.__**com <mailto:bearzheng2011@gmail.*
>> *com <bearzheng2011@gmail.com>>>
>>
>>
>>
>>
>>
>>
>>
>>
>> --
>> Sincerely,
>> Zheng Yu
>> Mobile:  (852) 60670059
>> Email: bearzheng2011@gmail.com <mailto:bearzheng2011@gmail.**com<bearzheng2011@gmail.com>
>> >
>>
>>
>>
>>
>


-- 
Sincerely,
Zheng Yu
Mobile:  (852) 60670059
Email:    bearzheng2011@gmail.com

Mime
View raw message