esme-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Pollak <feeder.of.the.be...@gmail.com>
Subject Statefulness and algorithms for social networks/graphs
Date Mon, 30 Nov 2009 20:40:21 GMT
Folks,

Over the last 6 or so months, we've had a bunch of discussions on the list
about statefulness, REST, and ESME's overall design.  I want to walk through
the design choices I've made for ESME and why "stateless" and other such
designs fail and are dead wrong for a social networking system.

There is no such thing as stateless.  Every web site has state.  The state
may change frequently or may change infrequently.  A web site made up of
static files has its state based on those static pages.  When those pages
are changed, the state changes.  State is kept somewhere for all web sites.

Some web sites will present a different state depending on who is accessing
the site.  This can be as simple as serving different pages depending on the
IP address or language preference expressed in the HTTP headers.  This is
sessionful.  The content is calculated based on the request.  This may be
more sophisticated in terms of authenticating the HTTP request and
presenting content based on the authentication.

A session for sessionful content may be short-lived (the length of the
request) or it may be longer lived (typically this is done with an initial
authentication phase resulting in a shared secret [JSESSIONID] that is
presented as an authentication proxy in subsequent requests.)

But no matter the authentication mechanism or the session lifespan, there
must exist a mechanism for translating the HTTP request into the content
presented for the session.

Far and away the most common way of persisting and calculating state is in a
relational database (RDBMS).  RDBMSs are awesome creatures.  They sit on top
of some excellent and well understood mathematics: set theory.  They have
well known and well understood concurrency mechanisms: transactions.  They
have been designed, built, tested, and optimized over the last generation.
RDBMSs offer a simple set of commands (SELECT, DELETE, INSERT, UPDATE) as
well as a generally human understandable set of semantics: people understand
that RDBMSs are a sets of things and there are simple ways to ask about
these sets.  RDBMSs have evolved along with ERP systems and have evolved to
meet the needs of these systems.

However, there are well known things that RDBMSs don't do well that include
tree structures (yeah, Oracle and others have extensions for tree walks, but
nothing is part of the SQL spec and the performance of these extensions is
not always the same as other models: a tree-walk in an RDBMS costs O(log n)
for each node where a tree walk in an OO system costs O(1)).  Social
networks/social graphs are another place where RDBMSs do not excel.

Let's dive down into this.

A naive implementation of a social messaging site runs something like these
tables:

   - Users(id, name, password)
   - Friends(owner, friend)
   - Messages(id, poster, content, date)

So, if we wanted to calculate the timeline for a given user at a given
instant, the query would look like:

SELECT messages.* FROM messages, friends WHERE friends.owner = current_user
AND messages.poster = friends.friend ORDER BY messages.date DESC LIMIT 20

Assuming we've got indexes on friends.owner, messages.poster and
messages.date, the query still results in O(n log n) where n is the
aggregate number of messages posted.  This is non-trivial and if you follow
someone who has posted 20,000 messages (yeah Anne, I'm talkin' to you), the
n log n cost becomes non-trivial.

Basically, each time a client asks for the latest timeline, you've got an
O(n log n) operation to determine state.  This doesn't scale.

The first obvious response to the issue is caching (capturing the state
beyond the duration of a short-lived session).  I'm going to skip caching
for a moment and do a more sophisticated implementation of timelines so we
can get better performance.

Let's create a mailbox table.  Each time someone publishes a message, a
reference to that message will be put in a Mailbox(owner, message, date)
table and we'll create an index on the table: (owner, date DESC)

This changes the query to:

SELECT messages.* FROM messages, mailbox WHERE mailbox.owner = current_user
AND messages.id = mailbox.message ORDER BY mailbox.date DESC LIMIT 20

Depending on your RDBMS, you will wind up with an O(log n) operation.  You
find the newest mailbox entry by user (O(log n)) and do an index walk until
you've found 20 entries (I'm putting aside the fact that looking up the 20
messages is an O(n log n) operation because 20 is a small number and the
messages will likely be in the database's cache... this operation is going
to be fast.)

I'm going to sidetrack for a moment.  I had the pleasure of talking over a
few beers at a baseball game with one of the senior engineers at Facebook.
We were talking about Facebook's scaling success.  His comment was that it
was successful but very expensive.  If there were more than 3% cache misses
from MySQL queries, the system would back up.  If they got more than 2%
cache misses from the memcached stuff in front of their MySQL servers the
system would back up.  So, basically Facebook has 195% of their data in RAM.

The net is that O(log n) is only going to work if you've got your entire
index in the cache of your RDBMS.  Even a dozen disk reads is going to turn
a 10ms query into a 250ms query and if you've got 1,000 users asking for a
status update, you'll wind up with disk thrashing and ultimately you will
not be able to satisfy all of those requests.

Let's make our discussion more concrete.  I'm assuming that an ESME instance
will support 25,000 users.  On average, a user will follow 100 people (100x
fan-out of messages).  Users will post one message every 30 minutes (48
messages a day).  The day lasts 10 hours (this is a reasonable approximation
for peakiness... basically, you're compressing 48 message sends in to a 10
hour period).  There are 300 days in a year.  These numbers are averages and
there will be some folks who are above average in terms of fan out (the CEO
will have a 25,000x fan out) and some folks are above average in number of
messages per day (yeah Anne, I'm lookin' at you... you too Dick.)

So, that means that each year, there will be 36,000M (36B) mailbox entries.
If each entry costs us 16 bytes of RAM for index purposes, that means we're
at 576B bytes of index.  There's no way that amount of index will fit in
RAM.  So, what happens if the average messages/day drops to 1, you're still
looking at 10GB of index.  Alternatively, you could purge messages after 3
weeks or limit timelines to a certain number of messages.  That's not
unreasonable, but it's also adding a constraint to the system to deal with
limitations of the RDBMS.  There are other alternatives.

Let me talk memcached for a minute.  In my opinion, memcached means that you
have a failed design.  Memcached means abandoning all the awesome things
that you get with an RDBMS: a mathematical model, a
concurrency/transactional model, durability guarantees, etc.

But, we could move our state from the calculate-on-demand model of the RDBMS
to the a calculate once and cache model using memcached.  This means that
you only take the nasty hits if the cache is not valid.  Putting aside the
cost of cache invalidation (I haven't covered the costs of updates in this
discussion because there's no need to go there... the implementation
failures can be demonstrated with just reads), if you have a simple cache
invalidation scheme, most of the cache entries will not survive for 15
minutes (I can go through the math, but I'm going to leave this one to the
reader).  You risk cache stampedes (more than 1 process rebuilding the cache
entry).  Basically, the naive memcached implementation buys you a little bit
of head room over the naive (non-mailbox) approach.  In order to get more
than 5x or so improvement (something that will serve a few thousand rather
than a few hundred users), you need to manipulate the cache entries
inserting/deleting individual messages.

The above paragraph in fact leads us in the direction of a better answer.

But first, let me state that I have proven that an RDBMS cannot be the sole
locus of state for a social messaging site that services more than a few
hundred users.  Period.  We must move state somewhere else and manage the
cached state manually rather than with queries and indexes.  Second, I have
not discussed short-lived vs. long-lived sessions yet.  I will get to that,
but first, let's walk through a design that gives us a concurrency model as
well as the performance we want.

Imagine a model where you interact with a User with a limited set of
(asynchronous) messages:

   - add/remove friend
   - add message to timeline
   - post message (the user has created a message and it needs to be
   processed)
   - get current timeline (with offsets and number of entries)

These are the basic messages needed to implement a social messaging site.
If we guaranty that a User will only process 1 message at a time, we have a
concurrency model.  It's simple and simple is good.  We have not defined
how/where Users store there state (it could be on a filesystem, in an RDBMS,
in a NoSQL store, who knows).  But we can say that adding a message is an
O(1) operation (prepending to the head of a singly linked list).  Each User
can have a caching policy (and that caching policy could be dynamic based on
the access characteristics for the User).  The sender of the message doesn't
block on the processing of the message (although the get current timeline
message will have an asynchronous response that the sender will likely block
on).

We have changed our abstraction from one where all data (tables and indexes)
are created equal to one where certain data structures are more prominent
(User and Message) than others (mailbox, friends).

We have lost something: transactions.  In this model, if I add Dick as a
friend, I am not guaranteed that I will receive Dick's next update... it may
take time for the messages to propagate to Dick's User and his Message may
be sent before the "add friend" message gets to him.  In the case of a
financial transaction, this would be fatal.  In the case of social
networking, this is a perfectly reasonable trade-off.

So far, we have not talked about long-lived sessions and how they are
valuable in such a model... an in particular in ESME.

If we add one more message to our User, some of the reasons for long-lived
sessions should become obvious:  updated me on timeline change.  If you can
register with the User for changes to the timeline it means that we don't
have to keep asking "are we there yet?"  When state change happens, it's
instantly propagated out to the listeners.  The alternative is for the
listeners to ask "are we there yet?" over and over.  The cost of asking "are
we there yet?" is non-trivial as anyone who has traveled with 5 year olds
can attest to.  Additionally, sometimes, when one if having a conversation,
it's nice to get an immediate response rather than waiting some polling
period.  Additionally, with a listener model, the User does not need to
store the date of each message (give me new messages since xxx) and that
cuts down cache storage costs by 50% (a big number across 25,000 users).

So, having a long-lived session has some performance benefits over a
short-lived session and polling, but this only part of the story.

One of the ways that RDBMSs get performance (and the way products like
Oracle distinguish themselves from the likes of MySQL) is the ability to
cache optimized query plans, cache the right data, and invalidate the right
caches at the right time.  The same requirements are going to come up in
ESME.

When I designed ESME, I changed the model from a Skittr model (1M users on a
single box) to a more enterprise-friendly model.  The key difference is that
I added the "actions" feature where each User got to see each message
processed in the system and analyze that message for content/context and
perform certain actions based on that analysis.  Things like "add all
message containing 'catfood' to my timeline" or forward all messages
containing "ESME to my followers" or "make an HTTP post of all messages from
my boss to a paging service" or "block 50% of the messages from Joe
Blabbermouth".  Actions are cool, but they are costly.  It means that every
message must be compared to every action definition in the system.  This is
expensive.  If each user has an average of 10 actions, that means each
message sent will have to be compared against 250,000 actions and if we have
a peak of 5 messages per hour per person, that's 31B comparisons per hour at
peak time or 9M action comparisons per second.  That's load.

During peak load, we will need to prioritize which Users are processing
messages/actions such that the system retains responsiveness and can drain
the load.  Put another way, knowing which Users have associated long-lived
sessions allows us to prioritize the message processing for those Users.  We
allow more threads to drain the message queues for those Users while
providing fewer threads for session-less Users.  Yeah, we could prioritize
on other heuristics, but long-lived session is dead simple and will cost us
5K bytes per logged in user.  Not a huge cost and lots of benefit.

So, between the existing long-lived session long polling is more efficient
than shortlived session repeated polling and the upcoming need for message
prioritization indicate that long-lived sessions are the right design
choice.

Also, I hope that the above discussion makes it clear why I am insistent on
message-oriented APIs rather than document/REST oriented APIs.  ESME's
design is not traditional and there are fewer tools helping us get the
implementation right.  On the other hand, implementing ESME on top of a
relational/REST model cannot be done.  Let's keep our design consistent from
the APIs back.

Thanks,

David

-- 
Lift, the simply functional web framework http://liftweb.net
Beginning Scala http://www.apress.com/book/view/1430219890
Follow me: http://twitter.com/dpp
Surf the harmonics

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message