hadoop-common-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Hadoop Wiki] Update of "ZooKeeper/Observers" by HenryRobinson
Date Wed, 15 Jul 2009 12:17:14 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification.

The following page has been changed by HenryRobinson:

New page:
= Observers =

== Proposal ==

Observers are a type of participant in a ZooKeeper ensemble that do not take part in the underlying
atomic broadcast protocol that ZK is built upon. In particular, they do not vote on new proposals
and can not become Leaders. 

Observers are informed of all committed proposals in zxid order. Observers may issue proposals,
and will be told about the outcome. 

== Use Cases ==

The main advantage of having Observers is to remove some of the burden of serving clients
from ensemble nodes which are more concerned with correctly executing instances of ZAB in
a timely fashion. Observers are not a critical component of a ZooKeeper ensemble, and therefore
do not have to meet strict timing guarantees. They can therefore be more heavily loaded without
fear of jeopardising the liveness of the ensemble.

This design provides read-scalability; i.e. the ability to add more read-only clients without
compromising write performance. Because Observers do not vote or participate in leader elections
the only impact of a careful implementation on ensemble performance should be the bookkeeping
required to keep track of them at the Leader, plus the messaging costs associated with informing
them of committed proposals.

In the current design, where clients can only connect to Followers, attaching too many clients
can slow down a Follower or even crash it. The behaviour of clients is such that, like a swarm
of insects, they will move to the next Follower which may well suffer the same fate. The performance
of the cluster, and eventually its liveness, would be compromised. Observers can insulate
the core ensemble against these problems. 

 * As a datacenter bridge: Forming a ZK ensemble between two datacenters is a problematic
endeavour as the high variance in latency between the datacenters could lead to false positive
failure detection and partitioning. However if the ensemble runs entirely in one datacenter,
and the second datacenter runs only Observers, partitions aren't problematic as the ensemble
remains connected. Clients of the Observers may still see and issue proposals.

 * As a link to a message bus: Some companies have expressed an interest in using ZK as a
component of a persistent reliable message bus. Observers would give a natural integration
point for this work: a plug-in mechanism could be used to attach the stream of proposals an
Observer sees to a publish-subscribe system, again without loading the core ensemble. 

 * To support dynamic membership: once dynamic membership is enabled it will be common for
nodes that intend to become Followers to connect to the ensemble in order to be ready to be
informed of the membership change that admits them. Observers naturally capture the state
of being connected but not being part of the voting ensemble.

== Proposed Design ==

An Observer is started much like any other ensemble peer. They find and connect to the Leader
through the same mechanism as Followers. Instead of sending a FOLLOWERINFO packet, they send
an OBSERVERINFO packet which tells the Leader that this is an Observer. The Leader from that
point on can distinguish the Observer and doesn't send it any Follower-specific messages.

The Leader and the Observer sync up through the usual mechanism. The Leader then only sends
INFORM messages to the Observer which are sent when a proposal has received enough votes to
be committed. If the Leader receives an ACK from an Observer, or if the Observer receives
a PROPOSAL from a Leader, it's an error.

Clients may connect to the Observer as if it were a Follower, and issue proposals, set watches
and so on. The Observer forwards REQUEST packets to the Leader, and commits the resulting
proposals once INFORM is received. Observers make the same guarantees about ordering of client
proposals that Followers do. Observers will eventually see the whole proposal log due to syncing
with the Leader upon every connection attempt. 

The fact that INFORM is the only message sent to an Observer for a proposal means that the
message cost of Observers is one message per Observer per proposal, compared to three per
Follower. However, this means that INFORM must contain enough information to commit each proposal
(unlike COMMIT messages which just include a zxid and are matched against the already seen
PROPOSAL); so the saving is essentially the ACK / COMMIT message pair. 

=== Backwards Compatibility ===

It is very important not to require a complete cluster restart for these changes, and to maintain
backwards compatibility with existing data. 

There are two cases to consider:

 1. An Observer tries to connect to a pre-Observer cluster: The Observer will succeed in connecting,
but once it sends an OBSERVERINFO packet the Leader will respond that it does not understand
the packet type and close the connection.

 2. A pre-Observer Follower tries to connect to an Observer-aware cluster: The behaviour of
Followers has not been changed. They still receive the same set of messages, and connect via
the same protocols, as before. Therefore they will be able to successfully connect to an upgraded

The data format has not changed, so logs will be backwards compatible. Downgrading will also
be possible to a pre-Observer version.

=== Security ===

Current security in ZK is achieved in two ways: ACLs on individual znodes that enforce policies
at the client, and a whitelist of ensemble nodes in the configuration file. 

Since Observers currently can connect from any source address, this removes all security from
the cluster. We must, at least, implement a whitelist of IP ranges from which an Observer
can connect. At this point security is the same as the current version - to attack a ZK cluster
we must write a compromised Follower and run it from one of the IP addresses in the configuration

We should look into authenticated connections, and if we implement the re-publish use case
must take sure to only re-publish proposals which are authorised by an ACL provided to the

=== Testing ===

We must test the claims made about backwards compatibility. Ideally we would have a running
cluster and a suite of tests that can be run against it as we do a rolling upgrade of the
ensemble nodes. This would be a helpful general tool for any major release.

Observers should be subject to the same test suite that Followers are (including stress tests

We should also test the basic properties of Observers: never receiving a proposal, not voting
to elect a leader (can maybe do this by constructing a situation where a leader would be elected
iff the Observer voted), seeing proposals in zxid order etc.

=== Performance ===

The ensemble performance should remain reasonably unchanged. There are two areas where extra
load is placed on the system:

 1. The Leader must keep track of all Observers. The cost is the same as for each Follower
- in fact in the current implementation FollowerHandlers are used to handle Observer connections.

 2. Observers must receive every proposal via a single INFORM message, adding N messages to
every proposal for N Observers. 

Regression testing of existing benchmarks will help validate these claims. 

== Proposed Implementation ==

Observers duplicate a great deal of functionality from Followers. Therefore I have introduced
a Peer base class that contains common code. There is an accompanying PeerZooKeeperServer
class from which ObserverZooKeeperServer and FollowerZooKeeperServer inherit. 

Observers currently have the same RequestProcessor pipeline as Followers. This might not be
the case in the future. 

Some of the code in the current patch is preparation for the dynamic membership patch which
is built upon this one. (Some of these changes need erasing for the final version of this
patch). There is PeerType enum which describes whether a Peer is a PARTICIPANT (i.e. will
follow if able to) or if it is an OBSERVER. Based on that the LeaderElection process knows
which state to move the Peer into once it has found a leader; from LOOKING to either FOLLOWING

Setting peerType=observer in a server's configuration file will ensure that its PeerType=OBSERVER.
Otherwise it defaults to PARTICIPANT. 

== What's left to do? ==

 1. Different timeout for Observers - can be much more lax about timeouts.
 2. Evaluate impact on JMX
 3. Update four-letter commands
 4. Final version of patch that cleans up rough edges, provides documentation and test cases.
 5. ...

View raw message