Fundamental design decisions have a big impact on how a network behaves under stress. I just noticed two otherwise unrelated posts both touching on network design issues that result in (or protect against) network collapse under stress.
Yesterday, I wrote about August's Skype outage where the system took over 36 hours to recover from a collapse. The issue was too many clients attempting to reconnect at once. Instead of clients “backing off” when supernodes weren’t immediately available, they kept hammering away, trying to log in.
Last week I wrote about the ways the Internet and the Telecoms networks respond to congestion in the backbone. In December 2006 an earthquake off the coast of Taiwan broke multiple fiber cables causing a congestion collapse on the Internet backbone in significant parts of Asia. The issue was too many computers attempting to reestablish TCP sessions with the result there was no capacity left for any session to actually send data.
In both cases, there were externally induced problems but, rather than recovering, there was the equivalent of a denial of service attack, self inflicted!
Both of these failures have a fairly simple solution, at least architecturally. Under conditions of severe overload, the system must be able to restrict new attempts (new TCP sessions, new Skype logins, etc.) to some small percentage of the available capacity. This allows the rest of the capacity to serve the logins, sessions or calls that do get through, with the result that what capacity remains is put to good use.
As I commented last week, this is one place where the telecoms industry has the correct architecture. When disaster strikes subjecting some part of the network to overload, it's easy to restrict new call attempts on trunks into the congested area, for example by call gapping. This limits the amount of new traffic to that which the network can handle. Thus, if only 30% capacity is available, at least the network handles 30% of the calls, not 3% or zero.
Here's one place where network architects can learn from established telecoms practice.
Comments