hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrew Purtell <apurt...@apache.org>
Subject some thoughts related to today's Skype outage
Date Wed, 22 Dec 2010 23:20:32 GMT
Today the Skype service suffered a major outage:

"Skype isn’t a network like a conventional phone or IM network – instead, it relies on
millions of individual connections between computers and phones to keep things up and running.
Some of these computers are what we call ‘supernodes’ – they act a bit like phone directories
for Skype. If you want to talk to someone, and your Skype app can’t find them immediately
(for example, because they’re connecting from a different location or from a different device)
your computer or phone will first try to find a supernode to figure out how to reach them.
Under normal circumstances, there are a large number of supernodes available. Unfortunately,
today, many of them were taken offline by a problem affecting some versions of Skype. As Skype
relies on being able to maintain contact with supernodes, it may appear offline for some of
What are we doing to help? Our engineers are creating new ‘mega-supernodes’ as fast as
they can, which should gradually return things to normal."
The Skype directory function is a peer-to-peer distributed system. As is common with P2P architectures,
it employs a global protocol so everyone can communicate. This introduces a global failure
domain. Although the Skype directory service is distributed, it is encapsulated within a single
failure domain in the sense that if there is a weakness in the shared protocol implementation,
a cascading or wide scale failure can result. Because P2P architectures by design are homogeneous,
there are pressures to deploy a homogeneous population. This is my understanding, given the
available information, of what happened with Skype today. After a wide scale failure of many
supernodes in a large homogeneous population, triggered by poison messages of some type, it
appears there was insufficient remaining service; the remaining supernodes were overwhelmed.
Related perhaps is the Amazon S3 outage from 2008: http://status.aws.amazon.com/s3-20080720.html

I claim that a cascading failure of a large homogeneous P2P population has a rough equivalence
to the failure of a master in a master-slave architecture.

But meanwhile in a master-slave architecture the master is already engineered to handle all
of the coordination traffic for the slaves. We expect the master to fail because as engineers
we are eternal pessimists, so we prepare a fail over plan. The fail over resources also have
sufficient resources to handle all of the slaves. On the one hand this is a pain. On the other
hand, recovery is easier to reason about and plan for. I have never witnessed a speedy recovery
from a large scale failure of a P2P system. I'm not seeing one here either. 
I'm not saying that always the master-slave or P2P architecture is better than the other.
Both have their places. One will more suitable for some use cases than the other. Something
like Skype is only possible with the P2P model. We know some systems where a singleton master
is a scalability limit. :-) (Master-slave != singleton-slave, strictly speaking.)
However sometimes we see proponents of fully decentralized systems look at master-slave architectures
and loudly proclaim "SPOF! SPOF!". It is worth considering that all designs have their own
Best regards,

    - Andy

Problems worthy of attack prove their worth by hitting back.
  - Piet Hein (via Tom White)


View raw message