zookeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <ted.dunn...@gmail.com>
Subject Re: Priority Queue?
Date Fri, 15 Jul 2011 01:28:50 GMT
On Thu, Jul 14, 2011 at 5:46 PM, Jordan Zimmerman <jzimmerman@netflix.com>wrote:

> But the implication is still the same. After each item processing,
> getChildren() will get called. This is seems incredibly inefficient to me.
> Further, the next item must wait until the getChildren() returns.
>

I don't think that a wait is required at all.  There will inevitably be some
delay between the insertion of a new item and recognition of that item by
all of the workers.  If some of the workers have already started on an out
of order item, or if they don't find out about a new high priority item for
a short period of time that isn't the end of the world.  The notification
will happen shortly after the deletion in a separate thread.  If additional
deletions have happened before the watcher fires, you will still only get
one notice with multiple changes.

Imagine a system that is putting hundreds (1000s?) of items in this queue.
> You're going to get an ugly stampede. For each Zk client processing the
> queue, you will require n (number of messages) trips to getChildren().
>

OK.  Let's take that even further.  Suppose that we have 100,000 todo items
and 1000 workers and that each todo takes 1 second and each getChildren
takes 200 ms because of slow network and such.  Workers will complete tasks
at the rate of 1000 tasks per second and will be sending deletes at that
rate.  The watcher threads in each worker will be continually firing in
order to update the state of the queue so we will get updates ever 200ms on
each client.  The average delay before *some* client finds out about the
most current state that it can see will be less than 1 ms.  The delay before
a new item is reflected in some clients state will be 200 ms (because that
is how long it takes to find out about something).  On average, 5000
getChildren will be firing each second as well as 1000 inserts and 1000
deletes for a total of 7000 operations per second.  This is entirely doable.


In more reasonable scenarios, the number of pending tasks will be much, much
less and the time to read the pending ops will probably be as small as a few
milliseconds, possibly 10's of milliseconds.  On a fully active queue, this
will lead to the getChildren calls saturating the ZK as you mentioned.  This
can be mitigated by having each client only do the getChildren when there is
an available local worker thread.  If no such worker is available, then a
flag should be set.  If a worker finishes and finds the flag set, they
should pick the next op and start an asynchronous getChildren call.  This
will decrease the total operations to about 3000 per second.

A higher performance design involves putting several queue items into each
znode.  Each worker should commit to doing several items on the list and
should send back an atomic update to the znode.  Upon completion, completion
notifications should be batched up for a short period of time before doing
another atomic update together with a scan for changed task lists.  With
good batching, performance in this model can be very high and latency can be
bounded.  With the previous scenario, but with a 10 ms task time, we can
keep the number of ops down to 4000 per second while pushing through 100x
more operations.  You still have a roughly 1 second delay before all workers
switch to higher priority tasks, but you only have a few milliseconds before
*some* worker switches to the higher priority tasks.

These ideas of batching of tasks and soft recognition of high priority tasks
is common in distributed systems since it allows you to amortize the cost of
coordination over a larger number of operations.  If you require no batching
and hard and fast recognition of high priority items, then you inherently
have high frequency broadcast storms which are the simple consequence of the
hard constraints.

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message