incubator-esme-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Markus Kohler <>
Subject Re: Statefulness and algorithms for social networks/graphs
Date Mon, 30 Nov 2009 22:00:43 GMT
Hi David,
Great explanation!

I made a few comments below.


"The best way to predict the future is to invent it" -- Alan Kay

On Mon, Nov 30, 2009 at 9:40 PM, David Pollak <
> wrote:

> Folks,
> Over the last 6 or so months, we've had a bunch of discussions on the list
> about statefulness, REST, and ESME's overall design.  I want to walk
> through
> the design choices I've made for ESME and why "stateless" and other such
> designs fail and are dead wrong for a social networking system.
> There is no such thing as stateless.  Every web site has state.  The state
> may change frequently or may change infrequently.  A web site made up of
> static files has its state based on those static pages.  When those pages
> are changed, the state changes.  State is kept somewhere for all web sites.
> Some web sites will present a different state depending on who is accessing
> the site.  This can be as simple as serving different pages depending on
> the
> IP address or language preference expressed in the HTTP headers.  This is
> sessionful.  The content is calculated based on the request.  This may be
> more sophisticated in terms of authenticating the HTTP request and
> presenting content based on the authentication.
I think what you call "sessionful" is what most people would describe as

> A session for sessionful content may be short-lived (the length of the
> request) or it may be longer lived (typically this is done with an initial
> authentication phase resulting in a shared secret [JSESSIONID] that is
> presented as an authentication proxy in subsequent requests.)
> But no matter the authentication mechanism or the session lifespan, there
> must exist a mechanism for translating the HTTP request into the content
> presented for the session.
> [snip]

Yes. I fully agree that an RDBMS is not ideal to implement an twitter like
messaging service. The issue is that a graph needs to be stored and RDBMS
are already not very good in storing trees (as you said). If ones look at
the various methods to store trees in RDMS (for example, one
will find that all those methods have their drawbacks and most likely will
cause performance issues with large graphs.

> I'm going to sidetrack for a moment.  I had the pleasure of talking over a
> few beers at a baseball game with one of the senior engineers at Facebook.
> We were talking about Facebook's scaling success.  His comment was that it
> was successful but very expensive.  If there were more than 3% cache misses
> from MySQL queries, the system would back up.  If they got more than 2%
> cache misses from the memcached stuff in front of their MySQL servers the
> system would back up.  So, basically Facebook has 195% of their data in
> RAM.
> The net is that O(log n) is only going to work if you've got your entire
> index in the cache of your RDBMS.  Even a dozen disk reads is going to turn
> a 10ms query into a 250ms query and if you've got 1,000 users asking for a
> status update, you'll wind up with disk thrashing and ultimately you will
> not be able to satisfy all of those requests.
> Let's make our discussion more concrete.  I'm assuming that an ESME
> instance
> will support 25,000 users.  On average, a user will follow 100 people (100x
> fan-out of messages).  Users will post one message every 30 minutes (48
> messages a day).  The day lasts 10 hours (this is a reasonable
> approximation
> for peakiness... basically, you're compressing 48 message sends in to a 10
> hour period).  There are 300 days in a year.  These numbers are averages
> and
> there will be some folks who are above average in terms of fan out (the CEO
> will have a 25,000x fan out) and some folks are above average in number of
> messages per day (yeah Anne, I'm lookin' at you... you too Dick.)
> So, that means that each year, there will be 36,000M (36B) mailbox entries.

I don't understand why we would need to store all entries in a cache,
instead of only keeping the last n entries for each user based on some
heuristics such as the last 3 days or something. I would somehow expect that
the probability that a user wants to see a message is exponentially
decreasing with the messages age. For example that someone wants to see  a
message that is the 1000 newest message in his timeline is probably almost

> If each entry costs us 16 bytes of RAM for index purposes, that means we're
> at 576B bytes of index.  There's no way that amount of index will fit in
> RAM.  So, what happens if the average messages/day drops to 1, you're still
> looking at 10GB of index.  Alternatively, you could purge messages after 3
> weeks or limit timelines to a certain number of messages.  That's not
> unreasonable, but it's also adding a constraint to the system to deal with
> limitations of the RDBMS.  There are other alternatives.
> Let me talk memcached for a minute.  In my opinion, memcached means that
> you
> have a failed design.  Memcached means abandoning all the awesome things
> that you get with an RDBMS: a mathematical model, a
> concurrency/transactional model, durability guarantees, etc.
> But, we could move our state from the calculate-on-demand model of the
> to the a calculate once and cache model using memcached.  This means that
> you only take the nasty hits if the cache is not valid.  Putting aside the
> cost of cache invalidation (I haven't covered the costs of updates in this
> discussion because there's no need to go there... the implementation
> failures can be demonstrated with just reads), if you have a simple cache
> invalidation scheme, most of the cache entries will not survive for 15
> minutes (I can go through the math, but I'm going to leave this one to the
> reader).  You risk cache stampedes (more than 1 process rebuilding the
> cache
> entry).  Basically, the naive memcached implementation buys you a little
> bit
> of head room over the naive (non-mailbox) approach.  In order to get more
> than 5x or so improvement (something that will serve a few thousand rather
> than a few hundred users), you need to manipulate the cache entries
> inserting/deleting individual messages.
> The above paragraph in fact leads us in the direction of a better answer.
> But first, let me state that I have proven that an RDBMS cannot be the sole
> locus of state for a social messaging site that services more than a few
> hundred users.  Period.  We must move state somewhere else and manage the
> cached state manually rather than with queries and indexes.  Second, I have
> not discussed short-lived vs. long-lived sessions yet.  I will get to that,
> but first, let's walk through a design that gives us a concurrency model as
> well as the performance we want.
> Imagine a model where you interact with a User with a limited set of
> (asynchronous) messages:
>   - add/remove friend
>   - add message to timeline
>   - post message (the user has created a message and it needs to be
>   processed)
>   - get current timeline (with offsets and number of entries)
> These are the basic messages needed to implement a social messaging site.
> If we guaranty that a User will only process 1 message at a time, we have a
> concurrency model.  It's simple and simple is good.  We have not defined
> how/where Users store there state (it could be on a filesystem, in an
> in a NoSQL store, who knows).  But we can say that adding a message is an
> O(1) operation (prepending to the head of a singly linked list).  Each User
> can have a caching policy (and that caching policy could be dynamic based
> on
> the access characteristics for the User).  The sender of the message
> doesn't
> block on the processing of the message (although the get current timeline
> message will have an asynchronous response that the sender will likely
> block
> on).
> I guess this is the same idea as the one I was talking about above.

> We have changed our abstraction from one where all data (tables and
> indexes)
> are created equal to one where certain data structures are more prominent
> (User and Message) than others (mailbox, friends).
> We have lost something: transactions.  In this model, if I add Dick as a
> friend, I am not guaranteed that I will receive Dick's next update... it
> may
> take time for the messages to propagate to Dick's User and his Message may
> be sent before the "add friend" message gets to him.  In the case of a
> financial transaction, this would be fatal.  In the case of social
> networking, this is a perfectly reasonable trade-off.
I agree it makes sense here to weaken consistency.

> So far, we have not talked about long-lived sessions and how they are
> valuable in such a model... an in particular in ESME.
> If we add one more message to our User, some of the reasons for long-lived
> sessions should become obvious:  updated me on timeline change.  If you can
> register with the User for changes to the timeline it means that we don't
> have to keep asking "are we there yet?"  When state change happens, it's
> instantly propagated out to the listeners.  The alternative is for the
> listeners to ask "are we there yet?" over and over.  The cost of asking
> "are
> we there yet?" is non-trivial as anyone who has traveled with 5 year olds
> can attest to.  Additionally, sometimes, when one if having a conversation,
> it's nice to get an immediate response rather than waiting some polling
> period.  Additionally, with a listener model, the User does not need to
> store the date of each message (give me new messages since xxx) and that
> cuts down cache storage costs by 50% (a big number across 25,000 users).
> So, having a long-lived session has some performance benefits over a
> short-lived session and polling, but this only part of the story.
> One of the ways that RDBMSs get performance (and the way products like
> Oracle distinguish themselves from the likes of MySQL) is the ability to
> cache optimized query plans, cache the right data, and invalidate the right
> caches at the right time.  The same requirements are going to come up in
> When I designed ESME, I changed the model from a Skittr model (1M users on
> a
> single box) to a more enterprise-friendly model.  The key difference is
> that
> I added the "actions" feature where each User got to see each message
> processed in the system and analyze that message for content/context and
> perform certain actions based on that analysis.  Things like "add all
> message containing 'catfood' to my timeline" or forward all messages
> containing "ESME to my followers" or "make an HTTP post of all messages
> from
> my boss to a paging service" or "block 50% of the messages from Joe
> Blabbermouth".  Actions are cool, but they are costly.  It means that every
> message must be compared to every action definition in the system.  This is
> expensive.  If each user has an average of 10 actions, that means each
> message sent will have to be compared against 250,000 actions and if we
> have
> a peak of 5 messages per hour per person, that's 31B comparisons per hour
> at
> peak time or 9M action comparisons per second.  That's load.
> During peak load, we will need to prioritize which Users are processing
> messages/actions such that the system retains responsiveness and can drain
> the load.  Put another way, knowing which Users have associated long-lived
> sessions allows us to prioritize the message processing for those Users.
>  We
> allow more threads to drain the message queues for those Users while
> providing fewer threads for session-less Users.  Yeah, we could prioritize
> on other heuristics, but long-lived session is dead simple and will cost us
> 5K bytes per logged in user.  Not a huge cost and lots of benefit.
I have no issue with some session state and 5K is really low, and therefore
this is not an issue.  I don't get why it has to be in the session's state
because you could as well use the information that a user is online as a
guidance, even if the state would be stored somewhere out of the session.
Wouldn't make a difference I guess and storing it in the session looks

> So, between the existing long-lived session long polling is more efficient
> than shortlived session repeated polling and the upcoming need for message
> prioritization indicate that long-lived sessions are the right design
> choice.
> Also, I hope that the above discussion makes it clear why I am insistent on
> message-oriented APIs rather than document/REST oriented APIs.  ESME's
> design is not traditional and there are fewer tools helping us get the
> implementation right.  On the other hand, implementing ESME on top of a
> relational/REST model cannot be done.  Let's keep our design consistent
> from
> the APIs back.
I'm really not religious about REST, but I would somehow assume that in an
Enterprise context it could be an requirement to send a link to someone else
pointing to a specific potentially old message in a certain Pool. That
sounds to me like a requirement for some kind of REST API.
Would it be costly in your model to get the message nr. X  (+ n  older
messages) in a users timeline?.


> Thanks,
> David
> --
> Lift, the simply functional web framework
> Beginning Scala
> Follow me:
> Surf the harmonics

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message