Since my last summary, some new information has emerged. Gerry Blackwell interviewed Skype's Director of Operations, Michael Jackson, got some comment from Martin Geddes, and wrote an interesting article.
I've also had an email exchange with Julian Cain, which I reprint below, and a brief exchange with Philippe Biondi of SecDev.org who referred my to an interesting blog post (in French) by his colleague Cédric Blancher at EADS France, Innovation Works (Suresnes).
New information from Michael Jackson (in Gerry Blackwell's article) includes the idea of five supernodes per cell (of 300 users):
Each supernode handles about 300 nearby users. Skype configures five in each cell for redundancy. So with upwards of nine million users online, it takes something like 150,000 supernodes to make Skype work.
The triggering event is still attributed to massive computer reboots after Microsoft's Patch Tuesday. Everything else in the article is consistent with the best earlier explanations, including Julian's which I summarized here, i.e. as Blackwell puts it:
... the real culprit, Skype now says—was a resource allocation algorithm in the client software that could not adapt to such a set of circumstances. Instead of clients “backing off” on their attempts to validate on the network when supernodes weren’t immediately available and waiting for the ship to right itself, they kept hammering away, trying to log in.
And the solution, having clients back off when supernodes aren't immediately responsive, is obvious. What's left to understand?
1. I still haven't seen a plausible explanation of why Microsoft's "Patch Tuesday" resulted in problems on Thursday morning, only a lot of questions. If the problem was induced by massive reboots, why didn't it happen on Wednesday morning?
2. I still haven't seen a reasonable discussion of scaling. As I wrote back in August,
I wonder, apart from the login server cluster as a single point of failure, is there also a scaling issue? FastTrack's breakthrough was the use of supernodes to make the system more scalable. But was that just one layer of scalability? If so, what happens when there are 300 million on-line users and one million supernodes? Perhaps Julian (or another P2P expert) could comment...
Indeed I emailed Julian about scaling and also about Joost. This was his reply.
On 09/05/2007 05:06 PM, Julian Cain wrote:
Skype and Joost are utterly different however Skype is more like Fasttrack(Kazaa). Joosts' Network architecture is mainly "Centralized", they have their own server farm of Supernodes as well as Authentication and Jabber servers. The nature of Joost is less dependent on peer to peer routing as it's basis is tuned towards QoS. Joost peers route traffic and relay UDP based payload as media data streams as well as keep a small cache of what they have recently viewed, however currently every Joost peer is directly connected back to the Joost home servers unlike Skype and Kazaa where once authentication occurs it's "out of our hands".
I agree on the extent of Skype scalability being very limited because of the nature of the Supernodes. At any one time the Supernodes hold ~300-500 child nodes and maintain an "Overlay" network which consists of another several hundred Supernode to Supernode connections. Ie* The Supernode network is very dense in order to provide for best means routing of least cost, however the flaw in this architecture is where the "Overlay" network reaches a capacity and is unable to reliably route traffic. I do not currently have any statistics on how the "Overlay" layer is not scalable but as more Supernodes arise the management of the "Decentralized Data Store" becomes a very hard task as well as keeping this "Overlay" in one single "in sync" network. This was proven with the Skype outage as the network was "trying to heal" it had to start from many 10s of thousands of "Overlay" networks which very slowly were able to sync again as a "single" network however is still an issue today with presence.
For the current Skype "Overlay" network to scale indefinitely while maintaining a "Single" network infrastructure it needs in place an organizational hierarchy of Supernodes and a level of Service for each of these Supernodes. *Ie. If Skype Supernodes worked in a way such as in Fasttrack then when the network reached 100 million users it would began to crawl. This is due to the dense nature of the upper "Overlay". I can only assume that Skype has thought of this and that when the Supernode ratio is beginning to "bottle neck" then there would be some ordered Hierarchy as to what role each Supernode was playing, otherwise the more Supernodes the more dense the "Overlay" the more the data is relayed back and forth before considering the "Supernode Overlay" into it's own Denial of Service attack.
I hope this helps to some degree, let me know if you have any other questions.
So to the extent I have time to look into P2P technology further, I plan to explore what's been written about hierarchy in P2P networks. Here are some references (which I've found but have not read as yet):
RFC 4981 on Survey of Research towards Robust Peer-to-Peer Networks: Search Methods
Hierarchical Peer-to-peer Systems by L. Garces-Erice, E.W. Biersack, P.A. Felber, K.W. Ross, and G. Urvoy-Keller.
An efficient peer-to-peer file sharing exploiting hierarchy and asymmetry, by G. Kwon and K. D. Ryu in the Proceedings of the 2003 Symposium on Applications and the Internet, 27-31 Jan. 2003 Page(s): 226 - 233.
< unfortunately only available on an IEEE pay-for site >