zookeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "singh.janmejay" <singh.janme...@gmail.com>
Subject Re: Best-practice guides on coordination of operations in distributed systems (and some C client specific questions)
Date Wed, 13 Jan 2016 18:26:16 GMT
Yes, that will depend on the way idempotency is implemented. The way I plan
to implement it is by using a monotonically increasing operation-id. Any
operation with id lower than last op-id will be identified as stale and
will not be executed. Because only one op is executed at a time, and ops
are executed in absolute order, op-id level identification for staleness is
sufficient.

--
Regards,
Janmejay

PS: Please blame the typos in this mail on my phone's uncivilized soft
keyboard sporting it's not-so-smart-assist technology.

On Jan 13, 2016 11:49 PM, "Alexander Shraer" <shralex@gmail.com> wrote:

> I may be wrong but I don't think that being idempotent gives you what you
> said. Just because f(f(x))=f(x) doesn't mean that f(g(f(x))) = g(f(x)) --
> this was my example. But if your system can detect that X was already
> executed (or if the operations are conditional on state) my scenario indeed
> can't happen.
>
>
> On Wed, Jan 13, 2016 at 2:08 AM, singh.janmejay <singh.janmejay@gmail.com>
> wrote:
>
> > @Alexander: In that scenario, write of X will be attempted by A, but
> > external system will not act upon write-X because that operation has
> > already been acted upon in the past. This is guaranteed by idempotent
> > operations invariant. But it does point out another problem, which I
> > hadn't handled in my original algorithm. Problem: If X and Y have both
> > not been issued yet, and if Y is issued before X towards external
> > system, because neither operations have executed yet, it'll overwrite
> > Y with X. I need another constraint, master should only issue 1
> > operation on a certain external-system at a time and must issue
> > operations in the order of operation-id (sequential-znode sequence
> > number). So we need the following invariants:
> > - order of issuing operation being fixed (matching order of creation
> > of operations)
> > - concurrency of operation fixed to 1
> > - idempotent execution on external-system side
> >
> > @Powell: Im kind of doing the same thing. Except the loop doesn't run
> > on consumer, instead it runs on master, which is assigning work to
> > consumers. So triggerWork function is basically changed to issueWork,
> > which is RPC + triggerWork. The replay if history is basically just
> > replay of 1 operation per operand-node (in this thread we are calling
> > it external-system), so its as if triggerWork failed, in which case we
> > need to re-execute triggerWork. Idempotency also follows from that
> > requirement. If triggerWork fails in the last step, and all the
> > desired effect that was necessary has happened, we will still need to
> > run triggerWork again, but we need awareness that actual work has been
> > done, which is why idempotency is necessary.
> >
> > Btw, thanks for continuing to spare time for this, I really appreciate
> > this feedback/validation.
> >
> > Thoughts?
> >
> > On Wed, Jan 13, 2016 at 3:47 AM, powell molleti
> > <powellm79@yahoo.com.invalid> wrote:
> > > Wouldn't a distributed queue recipe for consumer work?. Where one needs
> > to add extra logic something like this:
> > >
> > > with lock() {
> > >     p = queue.peek()
> > >     if triggerWork(p) is Done:
> > >         queue.pop()
> > > }
> > >
> > > With this a consumer that worked on it but crashed before popping the
> > queue would result in another consumer processing the same work.
> > >
> > > I am not sure with the details of where you are getting the work from
> > and the scale of it is but producers(leader) could write to this queue.
> > Since there is guarantee of read after write , producer could delete from
> > its local queue the work that was successfully queued. Hence again new
> > producer could re-send the last entry of work so one has to handle that.
> > Without more details on the origin of work etc its hard to design end to
> > end.
> > >
> > > I do not see a need to do a total replay of past history etc when using
> > ZK like system because ZK is built on idea of serialized and replicated
> > log, hence if you are using ZK then your design should be much simpler
> i.e
> > fail and re-start from last know transaction.
> > >
> > > Powell.
> > >
> > >
> > >
> > > On Tuesday, January 12, 2016 11:51 AM, Alexander Shraer <
> > shralex@gmail.com> wrote:
> > > Hi,
> > >
> > > With your suggestion, the following scenario seems possible: master A
> is
> > > about to write value X to an external system so it logs it to ZK, then
> > > freezes for some time, ZK suspects it as failed, another master B is
> > > elected, writes X (completing what A wanted to do)
> > > then starts doing something else and writes Y. Then leader A "wakes up"
> > and
> > > re-executes the old operation writing X which is now stale.
> > >
> > > perhaps if your external system supports conditional updates this can
> be
> > > avoided - a write of X only works if the current state is as expected.
> > >
> > > Alex
> > >
> > >
> > > On Tue, Jan 5, 2016 at 1:00 AM, singh.janmejay <
> singh.janmejay@gmail.com
> > >
> > > wrote:
> > >
> > >> Thanks for the replies everyone, most of it was very useful.
> > >>
> > >> @Alexander: The section of chubby paper you pointed me to tries to
> > >> address exactly what I was looking for. That clearly is one good way
> > >> of doing it. Im also thinking of an alternative way and can use a
> > >> review or some feedback on that.
> > >>
> > >> @Powel: About x509 auth on intra-cluster communication, I don't have a
> > >> blocking need for it, as it can be achieved by setting up firewall
> > >> rules to accept only from desired hosts. It may be a good-to-have
> > >> thing though, because in cloud-based scenarios where IP addresses are
> > >> re-used, a recycled IP can still talk to a secure zk-cluster unless
> > >> config is changed to remove the old peer IP and replace it with the
> > >> new one. Its clearly a corner-case though.
> > >>
> > >> Here is the approach Im thinking of:
> > >> - Implement all operations(atleast master-triggered operations) on
> > >> operand machines idempotently
> > >> - Have master journal these operations to ZK before issuing RPC
> > >> - In case master fails with some of these operations in flight, the
> > >> newly elected master will need to read all issued (but not retired
> > >> yet) operations and issue them again.
> > >> - Existing master(before failure or after failure) can retry and
> > >> retire operations according to whatever the retry policy and
> > >> success-criterion is.
> > >>
> > >> Why am I thinking of this as opposed to going with chubby sequencer
> > >> passing:
> > >> - I need to implement idempotency regardless, because recovery-path
> > >> involving master-death after successful execution of operation but
> > >> before writing ack to coordination service requires it. So idempotent
> > >> implementation complexity is here to stay.
> > >> - I need to increase surface-area of the architecture which is exposed
> > >> to coordination-service for sequencer validation. Which may bring
> > >> verification RPC in data-plane in some cases.
> > >> - The sequencer may expire after verification but before ack, in which
> > >> case new master may not recognize the operation as consistent with its
> > >> decisions (or previous decision path).
> > >>
> > >> Thoughts? Suggestions?
> > >>
> > >>
> > >>
> > >> On Sun, Jan 3, 2016 at 2:18 PM, Alexander Shraer <shralex@gmail.com>
> > >> wrote:
> > >> > regarding atomic multi-znode updates -- check out "multi" updates
> > >> > <
> > >>
> >
> http://tdunning.blogspot.com/2011/06/tour-of-multi-update-for-zookeeper.html
> > >> >
> > >> > .
> > >> >
> > >> > On Sat, Jan 2, 2016 at 10:45 PM, Alexander Shraer <
> shralex@gmail.com>
> > >> wrote:
> > >> >
> > >> >> for 1, see the chubby paper
> > >> >> <
> > >>
> >
> http://static.googleusercontent.com/media/research.google.com/en//archive/chubby-osdi06.pdf
> > >> >,
> > >> >> section 2.4.
> > >> >> for 2, I'm not sure I fully understand the question, but
> > essentially, ZK
> > >> >> guarantees that even during failures
> > >> >> consistency of updates is preserved. The user doesn't need to
do
> > >> anything
> > >> >> in particular to guarantee this, even
> > >> >> during leader failures. In such case, some suffix of operations
> > executed
> > >> >> by the leader may be lost if they weren't
> > >> >> previously acked by a majority.However, none of these operations
> > could
> > >> >> have been visible
> > >> >> to reads.
> > >> >>
> > >> >> On Fri, Jan 1, 2016 at 12:29 AM, powell molleti <
> > >> >> powellm79@yahoo.com.invalid> wrote:
> > >> >>
> > >> >>> Hi Janmejay,
> > >> >>> Regarding question 1, if a node takes a lock and the lock
has
> > timed-out
> > >> >>> from system perspective then it can mean that other nodes
are free
> > to
> > >> take
> > >> >>> the lock and work on the resource. Hence the history could
be well
> > >> into the
> > >> >>> future when the previous node discovers the time-out. The
question
> > of
> > >> >>> rollback in the specific context depends on the implementation
> > >> details, is
> > >> >>> the lock holder updating some common area?, then there could
be
> > >> corruption
> > >> >>> since other nodes are free to write in parallel to the first
one?.
> > In
> > >> the
> > >> >>> usual sense a time-out of lock held means the node which held
the
> > lock
> > >> is
> > >> >>> dead. It is upto the implementation to ensure this case and,
using
> > this
> > >> >>> primitive, if there is a timeout which means other nodes are
sure
> > that
> > >> no
> > >> >>> one else is working on the resource and hence can move forward.
> > >> >>> Question 2 seems to imply the assumption that leader has
> significant
> > >> work
> > >> >>> todo and leader change is quite common, which seems contrary
to
> > common
> > >> >>> implementation pattern. If the work can be broken down into
> smaller
> > >> chunks
> > >> >>> which need serialization separately then each chunk/work type
can
> > have
> > >> a
> > >> >>> different leader.
> > >> >>> For question 3, ZK does support auth and encryption for client
> > >> >>> connections but not for inter ZK node channels. Do you have
> > >> requirement to
> > >> >>> secure inter ZK nodes, can you let us know what your requirements
> > are
> > >> so we
> > >> >>> can implement a solution to fit all needs?.
> > >> >>> For question 4 the official implementation is C, people tend
to
> wrap
> > >> that
> > >> >>> with C++ and there should projects that use ZK doing that
you can
> > look
> > >> them
> > >> >>> up and see if you can separate it out and use them.
> > >> >>> Hope this helps.Powell.
> > >> >>>
> > >> >>>
> > >> >>>
> > >> >>>     On Thursday, December 31, 2015 8:07 AM, Edward Capriolo
<
> > >> >>> edward.capriolo@huffingtonpost.com> wrote:
> > >> >>>
> > >> >>>
> > >> >>>  Q:What is the best way of handling distributed-lock expiry?
The
> > owner
> > >> >>> of the lock managed to acquire it and may be in middle of
some
> > >> >>> computation when the session expires or lock expire
> > >> >>>
> > >> >>> If you are using Java a way I can see doing this is by using
the
> > >> >>> ExecutorCompletionService
> > >> >>>
> > >> >>>
> > >>
> >
> https://docs.oracle.com/javase/7/docs/api/java/util/concurrent/ExecutorCompletionService.html
> > >> >>> .
> > >> >>> It allows you to keep your workers in a group, You can poll
the
> > group
> > >> and
> > >> >>> provide cancel semantics as needed.
> > >> >>> An example of that service is here:
> > >> >>>
> > >> >>>
> > >>
> >
> https://github.com/edwardcapriolo/nibiru/blob/master/src/main/java/io/teknek/nibiru/coordinator/EventualCoordinator.java
> > >> >>> where I am issuing multiple reads and I want to abandon the
> process
> > if
> > >> >>> they
> > >> >>> do not timeout in a while. Many async/promices frameworks
do this
> by
> > >> >>> launching two task ComputationTask and a TimeoutTask that
returns
> > in 10
> > >> >>> seconds. Then they ask the completions service to poll. If
the
> > service
> > >> is
> > >> >>> given the TimoutTask after the timeout that means the Computation
> > did
> > >> not
> > >> >>> finish in time.
> > >> >>>
> > >> >>> Do people generally take action in middle of the computation
> (abort
> > it
> > >> and
> > >> >>> do itin a clever way such that effect appears atomic, so abort
is
> > >> >>> notreally
> > >> >>> visible, if so what are some of those clever ways)?
> > >> >>>
> > >> >>> The base issue is java's synchronized/ AtomicReference do
not
> have a
> > >> >>> rollback.
> > >> >>>
> > >> >>> There are a few ways I know to work around this. Clojure has
STM
> > >> (software
> > >> >>> Transactional Memory) such that if an exception is through
inside
> a
> > >> doSync
> > >> >>> all of the stems inside the critical block never happened.
This
> > assumes
> > >> >>> your using all clojure structures which you are probably not.
> > >> >>> A way co workers have done this is as follows. Move your entire
> > >> >>> transnational state into a SINGLE big object that you can
> > >> >>> copy/mutate/compare and swap. You never need to rollback each
> piece
> > >> >>> because
> > >> >>> your changing the clone up until the point you commit it.
> > >> >>> Writing reversal code is possible depending on the problem.
There
> > are
> > >> >>> questions to ask like "what if the reversal somehow fails?"
> > >> >>>
> > >> >>>
> > >> >>>
> > >> >>>
> > >> >>> On Thu, Dec 31, 2015 at 3:10 AM, singh.janmejay <
> > >> singh.janmejay@gmail.com
> > >> >>> >
> > >> >>> wrote:
> > >> >>>
> > >> >>> > Hi,
> > >> >>> >
> > >> >>> > Was wondering if there are any reference designs, patterns
on
> > >> handling
> > >> >>> > common operations involving distributed coordination.
> > >> >>> >
> > >> >>> > I have a few questions and I guess they must have been
asked
> > before,
> > >> I
> > >> >>> > am unsure what to search for to surface the right answers.
It'll
> > be
> > >> >>> > really valuable if someone can provide links to relevant
> > >> >>> > "best-practices guide" or "suggestions" per question
or share
> some
> > >> >>> > wisdom or ideas on patterns to do this in the best way.
> > >> >>> >
> > >> >>> > 1. What is the best way of handling distributed-lock
expiry? The
> > >> owner
> > >> >>> > of the lock managed to acquire it and may be in middle
of some
> > >> >>> > computation when the session expires or lock expires.
When it
> > >> finishes
> > >> >>> > that computation, it can tell that the lock expired,
but do
> people
> > >> >>> > generally take action in middle of the computation (abort
it and
> > do
> > >> it
> > >> >>> > in a clever way such that effect appears atomic, so abort
is not
> > >> >>> > really visible, if so what are some of those clever ways)?
Or is
> > the
> > >> >>> > right thing to do, is to write reversal-code, such that
> operations
> > >> can
> > >> >>> > be cleanly undone in case the verification at the end
of
> > computation
> > >> >>> > shows that lock expired? The later obviously is a lot
harder to
> > >> >>> > achieve.
> > >> >>> >
> > >> >>> > 2. Same as above for leader-election scenarios. Leader
generally
> > >> >>> > administers operations on data-systems that take significant
> time
> > to
> > >> >>> > complete and have significant resource overhead and RPC
to
> > administer
> > >> >>> > such operations synchronously from leader to data-node
can't be
> > >> atomic
> > >> >>> > and can't be made latency-resilient to such a degree
that
> issuing
> > >> >>> > operation across a large set of nodes on a cluster can
be
> > guaranteed
> > >> >>> > to finish without leader-change. What do people generally
do in
> > such
> > >> >>> > situations? How are timeouts for operations issued when
> operations
> > >> are
> > >> >>> > issued using sequential-znode as a per-datanode dedicated
queue?
> > How
> > >> >>> > well does it scale, and what are some things to watch-out
for
> > >> >>> > (operation-size, encoding, clustering into one znode
for
> atomicity
> > >> >>> > etc)? Or how are atomic operations that need to be issued
across
> > >> >>> > multiple data-nodes managed (do they have to be clobbered
into
> one
> > >> >>> > znode)?
> > >> >>> >
> > >> >>> > 3. How do people secure zookeeper based services? Is
> > >> >>> > client-certificate-verification the recommended way?
How well
> does
> > >> >>> > this work with C client? Is inter-zk-node communication
done
> with
> > >> >>> > X509-auth too?
> > >> >>> >
> > >> >>> > 4. What other projects, reference-implementations or
libraries
> > should
> > >> >>> > I look at for working with C client?
> > >> >>> >
> > >> >>> > Most of what I have asked revolves around leader or lock-owner
> > having
> > >> >>> > a false-failure (where it doesn't know that coordinator
thinks
> it
> > has
> > >> >>> > failed).
> > >> >>> >
> > >> >>> > --
> > >> >>> > Regards,
> > >> >>> > Janmejay
> > >> >>> > http://codehunk.wordpress.com
> > >> >>> >
> > >> >>>
> > >> >>>
> > >> >>>
> > >> >>>
> > >> >>
> > >> >>
> > >>
> > >>
> > >>
> > >> --
> > >> Regards,
> > >> Janmejay
> > >> http://codehunk.wordpress.com
> > >>
> >
> >
> >
> > --
> > Regards,
> > Janmejay
> > http://codehunk.wordpress.com
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message