« Walled gardens have collapsed | Main | Leaving for eComm in Amsterdam - Will tweet conference talks »

October 25, 2009


Rob Rand

Sounds right. During a DARPA WAN simulation exercise in the mid-90s I was field testing a networked audio unit I had designed. Audio throughput TO San Diego from Virginia was solid. The return was choked- no audio coming back. It was traced eventually to a defense WAN router in California that was actually dropping ALL packets coming East after having been slammed earlier by a spike in simulation traffic. The simulation hardware back East was dead reckoning everything so they hadn't yet noticed the lack of data from San Diego. The audio stream stoppage lit up the problem. After some more testing and discussion the sys managers put together a solution and reconfigured the petulant router.

Jeremy Jones

Very Interesting, thanks for this

Xeus Tsu

I do applaud the level of research that went into this, mobile wireless networks are notoriously closed and hard to get information on. However, I think you are really missing too much information to have any idea what exactly, is happening without looking at the network.

Is the packet being delayed on the up or down link? Ping requires a packet to leave and to come back. It is very common for the first ping in any series of pings (my experience is from a CDMA200 network (like Verizon)) to be delayed by several seconds. This is because a negotiation on the RF layer is requires since the device has released it's RF resources. Since I unfortunately haven't read the referenced article, I would say that 8 second ping times are very possible based simply on the fact that to send the ping request, I may need to bring up an RF channel, which does take time, meaning it is you're modem buffering the packet, not the UTRAN.

Although I haven't been watching closely on the specific complaints of AT&T users, the iPhone has another design flaw that makes it very network unfriendly, in the browser (not sure about other apps) if the data session is idle for 30 seconds, it will release the pdp Context. Basically means it releases it's IP address and all session state on the data network. As I understand it, this puts tremendous strain on the authentication and signaling whenever someone hits a link in the browser, as it's almost as if the user just turned on their phone for every web link hit.

I am not a spokesman for any Wireless carrier, these opinions are my own.


Three in Australia may have this same problem.

I have seen, on several occasions, ping times "reliably" in excess of 30 seconds on my three mobile broadband service.

Roger Wolff

The problem is probably that the throughput of the "over the air" part is variable. If the maximum number of clients is located close to the 3G tower, the throughput will be much higher than when everybody is far away. So configuring 120ms of buffer may be difficult because the throughput is not constant.

(A solution is to timestamp each packet as it comes into the buffer and drop packets AFTER buffering if they are too old.)


This is related to Firewall capacity. They have exhausted all throughput in FW as mobile sessions are short and eat lot of FW tunnels.


Over the air capacity isn't a problem unlike you suggest. Backhaul is.

A typical base station - 3 sector antenna x 3 blocks of 5MHz x HSPA 3.6 = 31.5Mbit over-the-air capacity.

Many 3G base stations have as little as a 2Mbit link (E1).

In a dual band 21Mbit HSPA base station - 3 sector antenna x 3 x 2 blocks of 5MHz x HSPA 21 = 378Mbit

That's too much for a 155Mbit fibre and needs gigabit backhaul.

John Bäckstrand

Enormous buffers with no "smarts" does not work for congested wired links, either. I once was using a 2Mbit connection shared over ~150 student apartments, and the buffers were not inifinite but I could easily tell the level of congestion from my latency to the outside world.

Either de-prioritizing bulk traffic (requires knowing what is bulk) or removing the buffers would have helped. Buffers and zero packet-loss is good for bulk transfers, but bad for pretty much everything else. If "everything else" just bypassed the queue in the buffer you won't see the latency any more, and everyone will have optimal conditions. The problem with that is of course that its hard to differentiate traffic in a way that always work well. Just downsizing buffers is a lot easier.


(Via /.) Thanks for the summary. This certainly brings back some memories (or perhaps nightmares?) from stochastic queuing theory!

PS: It's = "it is"; for possessive it, you want "its".


I am so glad to hear that I'm not the only on experencing this! I have been using AT&T as my home connection for over a year now. Several times a day it goes from 250K/s throughput to 10-50 second pings for a period of up to 2 minutes, with no packet loss! Resetting the cradlepoint always seems to clear this up, at least momentarily.

I went so far as swap out every part of the setup (modem, router, amplifier, antennas) just in case my equipment was at fault. Nothing made a difference. This connection is really my only option out here in the boonies and 70% of the time works great. I hope that AT&T can fix this and I thank you for the detailed explanation!


ICMP packets are usually pushed down in the priority/queue on most out of the box routers i know, results of these packets are irrelevant on these systems and thus should only be used for end to end testing


hopefully AT&T (and others) will use this information.
it also could be the case that engineers actually configured buffers correctly, but then were told by the management that they have to change it, so they can impress their bosses with "zero packet loss configuration"


AT&T are rather deaf to the best practices of the other 750 GSM/UMTS carriers around the world. As a former GSM/EDGE engineer involved with the standardisation of the radio layer I am sometimes stunned that it works at all. They don't apply the knowledge gained in other countries for buffer size and system parameters. The standards deliberately don't give guidence as to the default values (it's a European thing) but the vendors certainly do.

You could do with a bit more info on the issues happening at the radio interface. The RIL guarentees an error free, in sequence delivery of higher layer packets. It is designed this way as it has to combat the effects of fast fading (where the dignal suddenly drope from turning a corner), and multipath (where the same packet arrives late or out of sequence after bouncing off a few more buildings). IP doesn't handle these well at all, and without the RIL operating as it does there would be no mobile internet. Making IP work over long-thin networks is really quite well understood. The other physical issue that is that inconvienent for TCP is that the radio connection is seriously asymetric. The uplink is much slower than the downlink. Not a problem for UDP at all, but the uplink (mobile to network) TCP ACK/NACKs are delayed and do cause a ownlink performance bottleneck regardless of the network settings. There are buffer settings in the network that affect retransmissions, especially important if you are located on a cell boundary. I strongly suspect that the location used in the article is on a boundary between two cells. There is a hysterisis paramter to stop phones constantly switching between cells (when you are switching there is no data transmission on GSM, and minimal speed and much higher latency on 3G), but as AT&T doesn't listen to their vendors the values that govern this are not optimal. Buffers in the RNC are for 3G voice and data only, and do cause significant packet loss on the device when switching between EDGE and 3G. Unfortunately these buffers are not cleared on the RNC in this case, they have to time out. Thus there is a conflict. You run out of buffers, make them bigger, or have egregious packet loss when changing cells.

On the device there is very little you can do to remedy this except to stay still and not change cells, oh and reduce the MTU size to 1400. This is from the default 1500 on the iPhone. Not a huge change I'm afraid, but just about the only thing you can do to improve performance. I doubt AT&T will suddenly start to listen to anyone trying to improve their network.

John Engelhart

I can think of a potential alternate explanation.

There are two basic schools of thought on how a network should work. The first is smart ends, dumb network and the other is dumb ends, smart network. Opinions on which one is the 'best' is like discussing vi vs. emacs :). IP networking is in the "smart ends, dumb network" camp. Though an over generalization, network designs that come out of telcos and committees tend to be the "dumb ends, smart network" variety.

The "smart ends, dumb network" design of IP places very few constraints on what happens to packets in the network. Packets can be dropped, delivered out of order, or even duplicated. The end host needs to be smart enough to deal with all these potential problems.

The "dumb ends, smart network" designs typically "guarantee" (for some value of guarantee) the delivery of packets, and that they are delivered in the exact same order in which the source sent them.

It has been my experience that IP tends to interact very poorly when run on overly aggressive "dumb ends, smart network" layers. These networks try very hard to "hide" the underlying network problems, but in doing so they hide the very information that IP, and in particular TCP, need in order to 'sense' that there is a problem with the network. When you have ping round trip times that are in the ~8 seconds range, that means that packet stayed alive, somewhere, for that long... There's really only two plausible explanations: The first is the theory outlined here of "deep packet buffers", and the other is that "something" kept trying to deliver the packet, above and beyond what IP would normally do (ie, 'smart network').

Personally, I doubt very much it's a problem of "deep packet buffers". Assuming that TCP is the protocol used for the majority of traffic, and that there aren't any sources of a large amount of TCP "unfriendly" traffic/packets, then it's fairly unlikely the problem is with "deep packet buffers" because TCP tends to "self clock" new packets in to the network based on ack'd data from the far end. Once TCP has filled the available window, it can't send any more data until it gets an ACK for some of the outstanding data. In short, it's pretty hard to keep 8 seconds worth of packet buffers filled when all the end hosts aren't sending packets because they're waiting (up to 8 seconds) for the ACK's for the already in-flight packets. I haven't done the math, but I doubt very much 'the numbers' support a bandwidth delay product with 8 seconds of delay and a "best case" of ~5Mbit/sec limit (mobile tower air interface), not to mention that the network stacks in question probably haven't tuned their buffer sizes for high throughput under such conditions. (ie, it wouldn't surprise me if the amount of buffers needed would exceed the total amount of memory available in an iPhone.)

While I don't have any first hand experience using these networks, I get the impression that the problem is "chronic". If true, this is another indication that the problem isn't necessarily "deep packet buffers". It's been my experience that a problem like "packet buffers that are too big" tend to only manifest themselves at some inflection point of network usage- Things work fine until peak busy time when everything just collapses. This has a very tell-tale visual look in graphs, too- it tends to resemble "clipping" you see in audio waveforms and has an almost "crater" like appearance.

Glenn Brown

The same problem exists for many DSL customers: the DSL link is the bottleneck and ISPs often provide many seconds worth of buffering feeding the DSL link, leading to high latencies until TCP flow control kicks in... or continuously if the link is saturated with UDP traffic, which is not subject to TCP congestion control.

I hack around the DSL high latencies by creating an artificial bottleneck in my in-home router, policing the down link to 95% of the DSL capacity. This triggers TCP congestion control before the buffers fill, and solves the problem for the typical TCP download scenario. Linux support for this approach is described at http://lartc.org/howto/lartc.cookbook.ultimate-tc.html . Sadly, most residential gateways have no such feature: I had to switch to open-source "Tomato" firmware to police my down link, and Tomato has no equivalent feature for the up link. Still, it has solved my problems, because they were caused by TCP downloads, as would be most others' problems.

While the approach presented here does not apply to UDP-saturated links, or to shared wireless links where you cannot influence others' streams through the buffer, the DSL problem effects many users, and very few understand the problem, as evidenced by AT&T, a major ISP, presumably getting it wrong.


I wonder if anyone on Rogers in Canada also experiences long ping times? Certainly, in Toronto, my iPhone has been very reliable. 0 dropped calls so far, and up to 600 KB/s download speeds, or roughly 4-5 Mbps, during off peak times (aka. when I get lucky downloading 25MB or more of data.). Ping times nor uplink speed aren't that great within the Speed Test app, e.g. on a test at 11:19: 5.16 Mbps down, 0.13 Mbps up, 231 ms ping. But as far as I know, I've never had an 8000 ms delay! So I'd assume Canadian networks are "properly" configured .. I wonder what they're doing that AT&T isn't?


This just isn't true!!! I have never seen 8000 ms RTT while executing ping tests, if this happens it is not definitely due to AT&T network. Even with AT&T's edge and 3G R99 data, the max ping time I have seen was about 450 ms (pinging www.nokia.com from my dos prompt). With HSDPA the latency was about 70 ms. Although radio is the bottle neck when it comes to the bandwidth, fragmentation causes the effective PPS capacity and throughput to degrade. Also not all elements are able to handle jumbo frames.

Typical E2E latency from UE to any server hosted within AT&T network ( with UE being in HS) would be about 60-80 ms at the maximum.

AT&T's problems seem to be primarily due to their RAN capacity, core network elements are possibly under utilized although depending on the vendor and control plane capacity to handle mobility management they might need more RNC's as well. With smart phones increasing their population under AT&T's network and application's in the UE's managing connectivity to the network ( not sure if At&t already have features such as Cell PCH, URA PCH active in their RAN), the overall capacity offered for various transactions by the RNC decreases as the capacity and processing (CPU) are related.

Avoiding fragmentation in the network, Allocating right buffer size for the NE (specially GGSN), Optimizing the value for TCP window size to get the maximum throughput for a TCP application can improve throughpt. I highly doubt the statement "AT&T Wireless data congestion been self-inflicted".


I work directly with ATT Data Core on their wireless network and this all comes down to ATT taking wireless enegineers and throwing them in front of a router and letting this configure the network. Sure ATT has some CCIE but the group of people making the changes are not network engineers.

I work with these people every day and they are terrible. ATT Data Core needs to hire network engineers and not throw some cellular engineer to the lions.


Very interesting. But what's the deal with these popups on all the links and images? Makes it impossible to scroll using a mouse!


This doesn't sound right to me. If you buffer a bunch of stuff at the last hop, then, as long as you really don't run out of room, there's no reason for the sender to go into congestion avoidance. Even if there were a problem, a lot of the core routers use RED, which should cause them to do congestion avoidance even if the GGSN is being a bad citizen. What the big buffer will do is to screw up the sender's round-trip time (RTT) computation, which can do weird things to the congestion window. More importantly, RTT is instrumental in fast-retransmit, and I have no idea what would happen there if RTT were unreliable. Probably something bad.

Another thing that could indeed screw things up with really deep buffers is if they didn't configure RED or something like it on the GGSN. If you have deep buffers and you congest by just dropping packets on the way in, then you can get tail-drop synchronization, which really can cause congestive collapse.


So this might explain what's going on on the data side, but this doesn't explain why things are wonky on the voice side too.


This doesn't sound right to me. If you buffer a bunch of stuff at the last hop, then, as long as you really don't run out of room, there's no reason for the sender to go into congestion avoidance. Even if there were a problem, a lot of the core routers use RED,herve Leger Dress
Alexander Wang goddess dress
herve Leger Dress
About herve Leger which should cause them to do congestion avoidance even if the GGSN is being a bad citizen. What the big buffer will do is to screw up the sender's round-trip time (RTT) computation, which can do weird things to the congestion window. More importantly, RTT is instrumental in fast-retransmit, and I have no idea what would happen there if RTT were unreliable. Probably something bad.

The comments to this entry are closed.

My Photo

Search this Blog

Subscribe by Email

March 2014

Sun Mon Tue Wed Thu Fri Sat
2 3 4 5 6 7 8
9 10 11 12 13 14 15
16 17 18 19 20 21 22
23 24 25 26 27 28 29
30 31          


Site Meter

Upcoming Travel & Conferences

Twitter Feed

Become a Fan