esme-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ethan Jewett <>
Subject Re: Statefulness and algorithms for social networks/graphs
Date Mon, 30 Nov 2009 20:59:34 GMT
Wow David, that's a fantastic explanation. Thank you for taking the
time to write it. Should help a lot in the future (especially when I
have to remind myself why I'm messing around with Actors and Futures
in the api2 endpoint :-).

Can we put this or an edited version on the wiki?


On Mon, Nov 30, 2009 at 3:40 PM, David Pollak
<> wrote:
> Folks,
> Over the last 6 or so months, we've had a bunch of discussions on the list
> about statefulness, REST, and ESME's overall design.  I want to walk through
> the design choices I've made for ESME and why "stateless" and other such
> designs fail and are dead wrong for a social networking system.
> There is no such thing as stateless.  Every web site has state.  The state
> may change frequently or may change infrequently.  A web site made up of
> static files has its state based on those static pages.  When those pages
> are changed, the state changes.  State is kept somewhere for all web sites.
> Some web sites will present a different state depending on who is accessing
> the site.  This can be as simple as serving different pages depending on the
> IP address or language preference expressed in the HTTP headers.  This is
> sessionful.  The content is calculated based on the request.  This may be
> more sophisticated in terms of authenticating the HTTP request and
> presenting content based on the authentication.
> A session for sessionful content may be short-lived (the length of the
> request) or it may be longer lived (typically this is done with an initial
> authentication phase resulting in a shared secret [JSESSIONID] that is
> presented as an authentication proxy in subsequent requests.)
> But no matter the authentication mechanism or the session lifespan, there
> must exist a mechanism for translating the HTTP request into the content
> presented for the session.
> Far and away the most common way of persisting and calculating state is in a
> relational database (RDBMS).  RDBMSs are awesome creatures.  They sit on top
> of some excellent and well understood mathematics: set theory.  They have
> well known and well understood concurrency mechanisms: transactions.  They
> have been designed, built, tested, and optimized over the last generation.
> RDBMSs offer a simple set of commands (SELECT, DELETE, INSERT, UPDATE) as
> well as a generally human understandable set of semantics: people understand
> that RDBMSs are a sets of things and there are simple ways to ask about
> these sets.  RDBMSs have evolved along with ERP systems and have evolved to
> meet the needs of these systems.
> However, there are well known things that RDBMSs don't do well that include
> tree structures (yeah, Oracle and others have extensions for tree walks, but
> nothing is part of the SQL spec and the performance of these extensions is
> not always the same as other models: a tree-walk in an RDBMS costs O(log n)
> for each node where a tree walk in an OO system costs O(1)).  Social
> networks/social graphs are another place where RDBMSs do not excel.
> Let's dive down into this.
> A naive implementation of a social messaging site runs something like these
> tables:
>   - Users(id, name, password)
>   - Friends(owner, friend)
>   - Messages(id, poster, content, date)
> So, if we wanted to calculate the timeline for a given user at a given
> instant, the query would look like:
> SELECT messages.* FROM messages, friends WHERE friends.owner = current_user
> AND messages.poster = friends.friend ORDER BY DESC LIMIT 20
> Assuming we've got indexes on friends.owner, messages.poster and
>, the query still results in O(n log n) where n is the
> aggregate number of messages posted.  This is non-trivial and if you follow
> someone who has posted 20,000 messages (yeah Anne, I'm talkin' to you), the
> n log n cost becomes non-trivial.
> Basically, each time a client asks for the latest timeline, you've got an
> O(n log n) operation to determine state.  This doesn't scale.
> The first obvious response to the issue is caching (capturing the state
> beyond the duration of a short-lived session).  I'm going to skip caching
> for a moment and do a more sophisticated implementation of timelines so we
> can get better performance.
> Let's create a mailbox table.  Each time someone publishes a message, a
> reference to that message will be put in a Mailbox(owner, message, date)
> table and we'll create an index on the table: (owner, date DESC)
> This changes the query to:
> SELECT messages.* FROM messages, mailbox WHERE mailbox.owner = current_user
> AND = mailbox.message ORDER BY DESC LIMIT 20
> Depending on your RDBMS, you will wind up with an O(log n) operation.  You
> find the newest mailbox entry by user (O(log n)) and do an index walk until
> you've found 20 entries (I'm putting aside the fact that looking up the 20
> messages is an O(n log n) operation because 20 is a small number and the
> messages will likely be in the database's cache... this operation is going
> to be fast.)
> I'm going to sidetrack for a moment.  I had the pleasure of talking over a
> few beers at a baseball game with one of the senior engineers at Facebook.
> We were talking about Facebook's scaling success.  His comment was that it
> was successful but very expensive.  If there were more than 3% cache misses
> from MySQL queries, the system would back up.  If they got more than 2%
> cache misses from the memcached stuff in front of their MySQL servers the
> system would back up.  So, basically Facebook has 195% of their data in RAM.
> The net is that O(log n) is only going to work if you've got your entire
> index in the cache of your RDBMS.  Even a dozen disk reads is going to turn
> a 10ms query into a 250ms query and if you've got 1,000 users asking for a
> status update, you'll wind up with disk thrashing and ultimately you will
> not be able to satisfy all of those requests.
> Let's make our discussion more concrete.  I'm assuming that an ESME instance
> will support 25,000 users.  On average, a user will follow 100 people (100x
> fan-out of messages).  Users will post one message every 30 minutes (48
> messages a day).  The day lasts 10 hours (this is a reasonable approximation
> for peakiness... basically, you're compressing 48 message sends in to a 10
> hour period).  There are 300 days in a year.  These numbers are averages and
> there will be some folks who are above average in terms of fan out (the CEO
> will have a 25,000x fan out) and some folks are above average in number of
> messages per day (yeah Anne, I'm lookin' at you... you too Dick.)
> So, that means that each year, there will be 36,000M (36B) mailbox entries.
> If each entry costs us 16 bytes of RAM for index purposes, that means we're
> at 576B bytes of index.  There's no way that amount of index will fit in
> RAM.  So, what happens if the average messages/day drops to 1, you're still
> looking at 10GB of index.  Alternatively, you could purge messages after 3
> weeks or limit timelines to a certain number of messages.  That's not
> unreasonable, but it's also adding a constraint to the system to deal with
> limitations of the RDBMS.  There are other alternatives.
> Let me talk memcached for a minute.  In my opinion, memcached means that you
> have a failed design.  Memcached means abandoning all the awesome things
> that you get with an RDBMS: a mathematical model, a
> concurrency/transactional model, durability guarantees, etc.
> But, we could move our state from the calculate-on-demand model of the RDBMS
> to the a calculate once and cache model using memcached.  This means that
> you only take the nasty hits if the cache is not valid.  Putting aside the
> cost of cache invalidation (I haven't covered the costs of updates in this
> discussion because there's no need to go there... the implementation
> failures can be demonstrated with just reads), if you have a simple cache
> invalidation scheme, most of the cache entries will not survive for 15
> minutes (I can go through the math, but I'm going to leave this one to the
> reader).  You risk cache stampedes (more than 1 process rebuilding the cache
> entry).  Basically, the naive memcached implementation buys you a little bit
> of head room over the naive (non-mailbox) approach.  In order to get more
> than 5x or so improvement (something that will serve a few thousand rather
> than a few hundred users), you need to manipulate the cache entries
> inserting/deleting individual messages.
> The above paragraph in fact leads us in the direction of a better answer.
> But first, let me state that I have proven that an RDBMS cannot be the sole
> locus of state for a social messaging site that services more than a few
> hundred users.  Period.  We must move state somewhere else and manage the
> cached state manually rather than with queries and indexes.  Second, I have
> not discussed short-lived vs. long-lived sessions yet.  I will get to that,
> but first, let's walk through a design that gives us a concurrency model as
> well as the performance we want.
> Imagine a model where you interact with a User with a limited set of
> (asynchronous) messages:
>   - add/remove friend
>   - add message to timeline
>   - post message (the user has created a message and it needs to be
>   processed)
>   - get current timeline (with offsets and number of entries)
> These are the basic messages needed to implement a social messaging site.
> If we guaranty that a User will only process 1 message at a time, we have a
> concurrency model.  It's simple and simple is good.  We have not defined
> how/where Users store there state (it could be on a filesystem, in an RDBMS,
> in a NoSQL store, who knows).  But we can say that adding a message is an
> O(1) operation (prepending to the head of a singly linked list).  Each User
> can have a caching policy (and that caching policy could be dynamic based on
> the access characteristics for the User).  The sender of the message doesn't
> block on the processing of the message (although the get current timeline
> message will have an asynchronous response that the sender will likely block
> on).
> We have changed our abstraction from one where all data (tables and indexes)
> are created equal to one where certain data structures are more prominent
> (User and Message) than others (mailbox, friends).
> We have lost something: transactions.  In this model, if I add Dick as a
> friend, I am not guaranteed that I will receive Dick's next update... it may
> take time for the messages to propagate to Dick's User and his Message may
> be sent before the "add friend" message gets to him.  In the case of a
> financial transaction, this would be fatal.  In the case of social
> networking, this is a perfectly reasonable trade-off.
> So far, we have not talked about long-lived sessions and how they are
> valuable in such a model... an in particular in ESME.
> If we add one more message to our User, some of the reasons for long-lived
> sessions should become obvious:  updated me on timeline change.  If you can
> register with the User for changes to the timeline it means that we don't
> have to keep asking "are we there yet?"  When state change happens, it's
> instantly propagated out to the listeners.  The alternative is for the
> listeners to ask "are we there yet?" over and over.  The cost of asking "are
> we there yet?" is non-trivial as anyone who has traveled with 5 year olds
> can attest to.  Additionally, sometimes, when one if having a conversation,
> it's nice to get an immediate response rather than waiting some polling
> period.  Additionally, with a listener model, the User does not need to
> store the date of each message (give me new messages since xxx) and that
> cuts down cache storage costs by 50% (a big number across 25,000 users).
> So, having a long-lived session has some performance benefits over a
> short-lived session and polling, but this only part of the story.
> One of the ways that RDBMSs get performance (and the way products like
> Oracle distinguish themselves from the likes of MySQL) is the ability to
> cache optimized query plans, cache the right data, and invalidate the right
> caches at the right time.  The same requirements are going to come up in
> When I designed ESME, I changed the model from a Skittr model (1M users on a
> single box) to a more enterprise-friendly model.  The key difference is that
> I added the "actions" feature where each User got to see each message
> processed in the system and analyze that message for content/context and
> perform certain actions based on that analysis.  Things like "add all
> message containing 'catfood' to my timeline" or forward all messages
> containing "ESME to my followers" or "make an HTTP post of all messages from
> my boss to a paging service" or "block 50% of the messages from Joe
> Blabbermouth".  Actions are cool, but they are costly.  It means that every
> message must be compared to every action definition in the system.  This is
> expensive.  If each user has an average of 10 actions, that means each
> message sent will have to be compared against 250,000 actions and if we have
> a peak of 5 messages per hour per person, that's 31B comparisons per hour at
> peak time or 9M action comparisons per second.  That's load.
> During peak load, we will need to prioritize which Users are processing
> messages/actions such that the system retains responsiveness and can drain
> the load.  Put another way, knowing which Users have associated long-lived
> sessions allows us to prioritize the message processing for those Users.  We
> allow more threads to drain the message queues for those Users while
> providing fewer threads for session-less Users.  Yeah, we could prioritize
> on other heuristics, but long-lived session is dead simple and will cost us
> 5K bytes per logged in user.  Not a huge cost and lots of benefit.
> So, between the existing long-lived session long polling is more efficient
> than shortlived session repeated polling and the upcoming need for message
> prioritization indicate that long-lived sessions are the right design
> choice.
> Also, I hope that the above discussion makes it clear why I am insistent on
> message-oriented APIs rather than document/REST oriented APIs.  ESME's
> design is not traditional and there are fewer tools helping us get the
> implementation right.  On the other hand, implementing ESME on top of a
> relational/REST model cannot be done.  Let's keep our design consistent from
> the APIs back.
> Thanks,
> David
> --
> Lift, the simply functional web framework
> Beginning Scala
> Follow me:
> Surf the harmonics

View raw message