We've all heard or read stories about how iPhone usage has overloaded the AT&T Wireless network but it's likely at least some of their problems are the result of configuration errors ― specifically, congestion collapse induced by misconfigured buffers in their mobile core network.
In early September, David Reed sent this interesting message to the IRTF's "end-to-end" email list. List members include some world experts on Internet protocols. During the next couple of days, there were over 40 messages in related threads. While some of these experts were over-thinking the problem, if you are patient enough to read through the many messages, what emerges is clear. At least in the case David measured (from a hotel room in Chicago, while he had 5 bars of signal strength, using an AT&T Mercury 3G data modem in his laptop), the terrible throughput and extreme delays he experienced appear to result from overly large buffers in the routers &/or switches in AT&T's core network. Note: if you don't want to read all the list messages the short summary is: >8 second pings times! What's more the effect was bymodal: either ping times under 200 ms, or over 5 seconds.
Recently I was talking with a friend whose company continuously operates (and monitors) multiple 3G data links on the Verizon Wireless, AT&T Wireless and Sprint PCS networks. His data shows periods when the round trip time for http requests goes over 8 seconds, on the AT&T Wireless network only! I don't have a copy of his data that I can examine in detail, but when combined with David Reeds report, it certainly appears AT&T Wireless has configuration problems. If you read on you'll see this may not be the result of gross stupidity, but someone at AT&T Wireless should be a little embarrassed.
My (techie outsider's) Analysis of What's Happening
Buffers in Packet Networks
Routers (& switches) in a packet network have to include buffering in order to absorb transient traffic bursts. Unfortunately, despite decades of research and operational experience, there is no simple formula for how much buffering is optimal at any given location in a network. If you're interested in more detail, Ravi Prasad has a good review of the literature on pages 10 & 11 of his (April 2008) PhD thesis. But decades of operational experience have yielded some basic precepts and it appears AT&T Wireless is violating at least one basic precept.
The buffer in front of a congested link must induce some packet loss. TCP (the dominant Internet protocol) continuously increases it's transmit rate until it experiences packet loss, then it cuts its rate in half and enters a congestion avoidance mode. If the network becomes full but there is no packet loss, each TCP sender will keep increasing it's rate causing the network to suffer a congestion collapse.
In the case of a mobile network, the limited resource is over-the-air capacity. Backhaul may also be expensive, but it's relatively easy to over provision anything else in the operator's network. So the issue is, how big should the buffer be in the last router between the high capacity core network and the actual over-the-air data path to a subscriber? Ideally we'd like enough buffer to absorb momentary packet bursts that, averaged out, don't exceed the available over-the-air capacity. But as soon as the offered traffic exceeds the available over-the-air capacity, we want some packet loss. The complicating factor is the way 3G wireless networks schedule over-the-air traffic.Jitter in 3G radio networks
One cellular base station serves multiple users and the quality of the connection to any specific user depends upon instantaneous wireless propagation characteristics. These can vary second by second and even millisecond by millisecond when a user is moving. To deal with over-the-air losses, the basestation (the "Node-B" in a 3G network) keeps copies of each packet until a positive acknowledgement (ACK) is received, retransmitting the packet if the ACK is not received in time. Of course retransmissions introduce delay and jitter. Furthermore, at any given instant, some users wireless links are better than others. In order to maximize the total traffic in a cell, the 3G MAC layer schedules transmissions to individual users based on who has the best instantaneous throughput. This is an efficient solution but it also introduces different amounts of jitter into each user's data path. Luckily these effects are well understood and not that severe. With HSDPA, the basic transmission time interval is 2ms so total delay variation is relatively small. This graph from Jang et al (1) is typical of measured values in an HSDPA network.
Most jitter is below 15 ms. Measurements of ping latency between 3G wireless devices and the first IP server at the edge of the mobile core network (typically the GGSN) can extend out to over a second as this graph from Mun Choon Chan and R. Ramjee (2) shows:
but most of the time, total IP latency is a few hundred milliseconds. As mentioned earlier, David Reed reported bimodal operation on the AT&T Wireless network with normal behavior yielding ping times under 200 ms.
Likely cause of AT&T's problems
So what is happening in the AT&T Wireless network when ping times go over 8 seconds? We know how a customer's IP packets are passed through an operator's mobile core network. They are tunneled all the way from the handset to the Gateway GPRS Signaling Node (GGSN), i.e. to the router where the mobile core network connects to other networks. The protocol stacks for this tunneling look like this:
More recent versions replace the ATM and AAL5 with Ethernet and IP, but the user never sees this as user IP data is tunneled across the top of the diagram (carried by PDCP and GTP-U). As a result, user traceroutes can't reveal the detail of what's happening in the core network. The first thing the user can see is the GGSN (the gateway to the next network). So we can't make conclusive measurements from outside the network, but we do know a few more things.
The bottleneck link is the over-the-air link, i.e. the connection from radio access network or UTRAN to the Mobile Statation (MS) in the above diagram, therefore the critical buffers are those at the UTRAN. In practice the UTRAN includes both the basestations (called Node-Bs) and the Radio Network Controllers (RNCs) which coordinate handovers between basestations (among other things). Because of hand-overs, the amount of data buffered at the Node-B is relatively small. It's the buffers at the RNC that must be large enough to deal with the delay variations in the radio network and yet small enough to induce packet loss when the network gets congested.
While I don't personally have experience managing a 3G HSDPA network, my impression is UTRAN buffers are normally less than 200 ms. Recently Yerima and Al-Begain presented an interesting paper (3) on buffer management in 3.5G wireless networks in which they concluded that 120 ms buffers were ideal for downlink traffic in a specific UMTS-HSDPA configuration.
Zero Packet Loss
It appears AT&T Wireless has configured their RNC buffers so there is no packet loss, i.e. with buffers capable of holding more than ten seconds of data. Zero packet loss may sound impressive to a telephone guy, but it causes TCP congestion collapse and thus doesn't work for the mobile Internet!
(1) 3G and 3.5G Wireless Network Performance Measured from Moving Cars and High-Speed Trains, by Keon Jang† (email@example.com), Mongnam Han† (firstname.lastname@example.org), Soohyun Cho∗ (email@example.com), Hyung-Keun Ryu∗ (firstname.lastname@example.org), Jaehwa Lee∗ (email@example.com), Youngseok Lee‡ (firstname.lastname@example.org), Sue Moon† (email@example.com)
(2) TCP/IP performance over 3G wireless links with rate and delay variation,
(3) Dynamic Buffer Management for Multimedia Services in 3.5G Wireless Networks,