This is part 4 in a sequence of posts examining how broadband services actually work. Part 1 looked at ISP concentration ratios. Part 2 examined the effect of averaging traffic from many subscribers. Part 3 considered the impact of congestion on user experience. Here we'll take a finer grained look at traffic in the access network.
How do we achieve "zero" packet loss? Most websites are "hosted" and connect to the backbone by Fast Ethernet or Gigabit Ethernet links. Individual websites may be mismanaged, but there is no excuse for packet loss here.
Internet backbone links and backbone-to-backbone hand-offs are extremely high capacity multiplexing the traffic of thousands to millions of users. Traffic statistics are predictable and operators compete on the basis of service levels, so packet loss is typically zero.
If there are performance problems, they're usually in the access network. Of course, one bottleneck is the service I've signed up for, for example a DSL link at 6 Mbps /1 Mbps. By definition my traffic is shaped to the service level I've signed up for. This shaping typically happens at the first aggregation point (CMTS, DSLAM or WISP P2MP radio) or, on an IP or MAC address basis, at a bandwidth management device further upstream. If they arise, problems are in the aggregation points and/or links between me and the backbone.
Since packet loss significantly degrades user experience, we'd like to understand how much over-provisioning a broadband service provider needs in order to absolutely minimize packet loss in the access network. To understand this we need to look at smaller levels of aggregation (dozens and hundreds of users) and shorter intervals (sub-second rather than 5 minute averages, as few routers buffer more than a few tens or hundreds of milliseconds of traffic).
But first lets look at some five minute averages from a WISP with ~300 customers, each with either 1 Mbps or 3 Mbps service.
I don't have sub-second averages for the specific WISP above, but Fergal Toomey, Chief Scientist at Corvil Ltd., has some very detailed measurements on a similar Internet access network providing services to users at 512 Kbps, 1 Mbps and 2 Mbps. These are available in a whitepaper you can request from Corvil's website. Fergal's measurements were made at the Internet POP looking at traffic headed towards individual users. Also, his measurements were made in 2004 when most hosted servers had 100 Mbps connections to the backbone and the majority of computers had TCP stacks configured for a maximum TCP receive window of 32KB. In this graph, green shows 5 minutes averages, while light blue shows 500 ms averages. Dark blue shows 5 ms averages and they are striking!
Web browsers typically use persistent TCP connections to minimize connection overhead and to avoid repeating TCP's slow start. As a result, servers will frequently respond to a new request with a full window's worth of data. In 2004 that was 32 KBs of data or ~23 packets of up to 1500 bytes each. Today it could be much more. Absorbing one such burst should be no problem, but to what extent are such bursts coincident?
Beyond web browsing, detailed traffic analysis uncovers a variety of odd behavior and some less savory behavior. For example, one burst was a blast of many 41-byte packets sent by a web application that apparently does a series of 1-byte writes to a TCP socket with the NO_DELAY flag set. Several thousand packets arrive in quick succession causing packet loss. Worse yet, follow-on losses occur as retransmissions cause a decaying series of echo bursts.
On a more malicious note, the data includes bursts caused by port scans. These consist of many, many minimum size probe packets that attempt to scan all of a subscriber's ports in one enormous burst. Luckily these lost packets are not retransmitted.
Beyond these, Fergal Toomey's analysis uncovered a variety of application anomalies that I found unexpected, but which crop up in the real world.
What headroom is really required?
Clearly we could just increase the size of the buffers in the router and smooth out any of these bursts, but that would also increase the delay while the burst lasts and thus introduce significant jitter. A better question is how much headroom is required to guarantee zero packet loss and no more than 20 ms of incremental delay?
Conveniently, Fergal Toomey has run that calculation on the data set above and the result looks like this:
In this graph, the average traffic (the "mean" shown in light green) peaks just shy of 5 Mbps. To guarantee zero packet loss and no more than 20 ms incremental delay 100 % of the time requires 19 Mbps of capacity or nearly 4x. However, to guarantee zero packet loss and no more than 20 ms of delay 99% of the time, requires only 8 Mbps, i.e. a little less than 2x.
Interestingly, while ISP practices tend to be trade secrets, the rule of thumb that 2x headroom provides good service crops up repeatedly.
As a practical matter, real Internet traffic is fairly complex and includes a wide variety of anomalous behaviors. Rules of thumb based on averaging the traffic of a few hundred or even a few thousand subscribers are likely to be very expensive (in terms of excess capacity). A better approach is to continuously monitor packet loss and delay variation (jitter), adding capacity as needed to keep those measures below desired levels.
Zero packet loss and less than 20 ms incremental jitter, more than 99 percent of the time sounds plausible objective to me.
Now if only my broadband service provider was willing to share any information about their service level objectives or performance. Ooops, I must be dreaming.