zookeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alexander Shraer <shra...@gmail.com>
Subject Re: Best-practice guides on coordination of operations in distributed systems (and some C client specific questions)
Date Sun, 03 Jan 2016 08:45:40 GMT
for 1, see the chubby paper
section 2.4.
for 2, I'm not sure I fully understand the question, but essentially, ZK
guarantees that even during failures
consistency of updates is preserved. The user doesn't need to do anything
in particular to guarantee this, even
during leader failures. In such case, some suffix of operations executed by
the leader may be lost if they weren't
previously acked by a majority.However, none of these operations could have
been visible
to reads.

On Fri, Jan 1, 2016 at 12:29 AM, powell molleti <powellm79@yahoo.com.invalid
> wrote:

> Hi Janmejay,
> Regarding question 1, if a node takes a lock and the lock has timed-out
> from system perspective then it can mean that other nodes are free to take
> the lock and work on the resource. Hence the history could be well into the
> future when the previous node discovers the time-out. The question of
> rollback in the specific context depends on the implementation details, is
> the lock holder updating some common area?, then there could be corruption
> since other nodes are free to write in parallel to the first one?. In the
> usual sense a time-out of lock held means the node which held the lock is
> dead. It is upto the implementation to ensure this case and, using this
> primitive, if there is a timeout which means other nodes are sure that no
> one else is working on the resource and hence can move forward.
> Question 2 seems to imply the assumption that leader has significant work
> todo and leader change is quite common, which seems contrary to common
> implementation pattern. If the work can be broken down into smaller chunks
> which need serialization separately then each chunk/work type can have a
> different leader.
> For question 3, ZK does support auth and encryption for client connections
> but not for inter ZK node channels. Do you have requirement to secure inter
> ZK nodes, can you let us know what your requirements are so we can
> implement a solution to fit all needs?.
> For question 4 the official implementation is C, people tend to wrap that
> with C++ and there should projects that use ZK doing that you can look them
> up and see if you can separate it out and use them.
> Hope this helps.Powell.
>     On Thursday, December 31, 2015 8:07 AM, Edward Capriolo <
> edward.capriolo@huffingtonpost.com> wrote:
>  Q:What is the best way of handling distributed-lock expiry? The owner
> of the lock managed to acquire it and may be in middle of some
> computation when the session expires or lock expire
> If you are using Java a way I can see doing this is by using the
> ExecutorCompletionService
> https://docs.oracle.com/javase/7/docs/api/java/util/concurrent/ExecutorCompletionService.html
> .
> It allows you to keep your workers in a group, You can poll the group and
> provide cancel semantics as needed.
> An example of that service is here:
> https://github.com/edwardcapriolo/nibiru/blob/master/src/main/java/io/teknek/nibiru/coordinator/EventualCoordinator.java
> where I am issuing multiple reads and I want to abandon the process if they
> do not timeout in a while. Many async/promices frameworks do this by
> launching two task ComputationTask and a TimeoutTask that returns in 10
> seconds. Then they ask the completions service to poll. If the service is
> given the TimoutTask after the timeout that means the Computation did not
> finish in time.
> Do people generally take action in middle of the computation (abort it and
> do itin a clever way such that effect appears atomic, so abort is notreally
> visible, if so what are some of those clever ways)?
> The base issue is java's synchronized/ AtomicReference do not have a
> rollback.
> There are a few ways I know to work around this. Clojure has STM (software
> Transactional Memory) such that if an exception is through inside a doSync
> all of the stems inside the critical block never happened. This assumes
> your using all clojure structures which you are probably not.
> A way co workers have done this is as follows. Move your entire
> transnational state into a SINGLE big object that you can
> copy/mutate/compare and swap. You never need to rollback each piece because
> your changing the clone up until the point you commit it.
> Writing reversal code is possible depending on the problem. There are
> questions to ask like "what if the reversal somehow fails?"
> On Thu, Dec 31, 2015 at 3:10 AM, singh.janmejay <singh.janmejay@gmail.com>
> wrote:
> > Hi,
> >
> > Was wondering if there are any reference designs, patterns on handling
> > common operations involving distributed coordination.
> >
> > I have a few questions and I guess they must have been asked before, I
> > am unsure what to search for to surface the right answers. It'll be
> > really valuable if someone can provide links to relevant
> > "best-practices guide" or "suggestions" per question or share some
> > wisdom or ideas on patterns to do this in the best way.
> >
> > 1. What is the best way of handling distributed-lock expiry? The owner
> > of the lock managed to acquire it and may be in middle of some
> > computation when the session expires or lock expires. When it finishes
> > that computation, it can tell that the lock expired, but do people
> > generally take action in middle of the computation (abort it and do it
> > in a clever way such that effect appears atomic, so abort is not
> > really visible, if so what are some of those clever ways)? Or is the
> > right thing to do, is to write reversal-code, such that operations can
> > be cleanly undone in case the verification at the end of computation
> > shows that lock expired? The later obviously is a lot harder to
> > achieve.
> >
> > 2. Same as above for leader-election scenarios. Leader generally
> > administers operations on data-systems that take significant time to
> > complete and have significant resource overhead and RPC to administer
> > such operations synchronously from leader to data-node can't be atomic
> > and can't be made latency-resilient to such a degree that issuing
> > operation across a large set of nodes on a cluster can be guaranteed
> > to finish without leader-change. What do people generally do in such
> > situations? How are timeouts for operations issued when operations are
> > issued using sequential-znode as a per-datanode dedicated queue? How
> > well does it scale, and what are some things to watch-out for
> > (operation-size, encoding, clustering into one znode for atomicity
> > etc)? Or how are atomic operations that need to be issued across
> > multiple data-nodes managed (do they have to be clobbered into one
> > znode)?
> >
> > 3. How do people secure zookeeper based services? Is
> > client-certificate-verification the recommended way? How well does
> > this work with C client? Is inter-zk-node communication done with
> > X509-auth too?
> >
> > 4. What other projects, reference-implementations or libraries should
> > I look at for working with C client?
> >
> > Most of what I have asked revolves around leader or lock-owner having
> > a false-failure (where it doesn't know that coordinator thinks it has
> > failed).
> >
> > --
> > Regards,
> > Janmejay
> > http://codehunk.wordpress.com
> >

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message