It's been more than ten days since the global Skype outage – time to reconsider what actually happened. The most credible analysis is not from Skype, but from Julian Cain in a series of comments (here, here and here) that he made to a Gigom article about the outage (or see the single file in "References" below). Julian is lead architect at Pando and, earlier, was head of Mac development for Kazaa at Sharmen Networks. So he knows a lot about peer-to-peer networks and his work at Sharmen put him in a position to know quite a bit about the P2P technology that's also used by Skype (and likely by Joost).
Skype's P2P technology was evolved from FastTrack, originally developed for Kazaa. Their P2P network consists of clients and supernodes. Skype distributes client software which includes all necessary supernode software, so any client that has appropriate capacity and connectivity can be promoted to become a supernode. Supernodes dynamically link to other supernodes to support a distributed database and distributed index (called the distributed hash table or DHT). For Skype, the DHT layer is responsible for maintaining client presence info, contacts and icons/avatars, and handling call routing.
But as I pointed out in several posts during the outage, there's also a centralized component to the Skype network. That's the login servers. Julian refers to them at the "authentication servers" and/or "login/connectivity servers." They are implemented as one cluster of about 50 machines. As for the root cause of the outage, he asserts:
Skype employees introduced code into the "login/connectivity" server farm that was not compatible with current Skype clients.
While that was the root cause, it was helped along by other network characteristics, notably that each client connects to only one supernode at a time. According to Julian, there are 300+ clients per supernode and if a supernode goes off line, the 300 or so clients connected to it must reenter their "connecting" sequence, i.e., find and connect to another supernode.
A network with 8 million on-line users implies ~27K supernodes, a figure that's consistent with the ~20K supernodes estimated by Desclaux and Kortchinsky in 2005-2006 (see their June 2006 Recon presentation, PDF here). The other point from measurements by Desclaux and Kortchinsky is that each supernode attempts to maintain a list of all other supernodes which means there is a substantial amount of traffic between supernodes. This clearly contributed to the slow recovery, during which Julian commented:
Right now there are approximately 10,000 Skype networks instead of one single "in sync" network.
So I wonder, apart from the login server cluster as a single point of failure, is there also a scaling issue? FastTrack's breakthrough was the use of supernodes to make the system more scalable. But was that just one layer of scalability? If so, what happens when there are 300 million on-line users and one million supernodes? Perhaps Julian (or another P2P expert) could comment...
I've extracted and assembled a complete copy of Julian's relevant comments.
Skype traffic during the week of the outage, captured by Phil Wolff of Skype Journal.