Monday 11 June 2007

Broadband purgatory

Any experienced network administrator will tell you: the worse problems are not when nothing works but when it sort of works but not quite. So when I started having performance problems with my broadband connection 6 weeks ago, I knew I was in trouble. How to explain the problem to my ISP's helpdesk? I could download pages, even though they sometimes timed out but I couldn't upload any file. So my flickr photo stream started to dry out.

Of course, the first reaction of my ISP was to blame my equipment. As I had the same problem on a Power Mac G5 running OS-X and an IBM laptop running Windows XP, with Firefox as well as Internet Explorer, whether I was connected with to the network wirelesly or with a cable, it quickly came apparent that the only potential culprit on my side was the broadband router. As I was due for an upgrade anyway, I bought a new one and proved that I still had the problem with two different routers.

Then followed the various requests for test outputs to see what could go wrong. All this was a very slow process as I could not test during the day while my ISP's helpdesk was only able to review the tests during working hours. So a routine set in: I do a test during the evening, they look at it the following day, they suggest something else, I do a new test, etc. I quickly became frustrated and ended up writing a shell script to automate calls to traceroute followed by ping in order to get any decent statistics. And, lo and behold! It then became obvious that, anywhere on my network I had no packet loss whatsoever, whereas as soon as I reached my ISP's first router, I had between 5% and 50% packet loss depending on packet size. Now, 5% loss is huge and definitely way too high for TCP/IP to function properly. 50% loss doesn't even bear thinking about. The most likely culprit was now my broadband line. So my ISP passed the call to BT and, miracle, my connection now works like a charm! It only took 6 weeks to get there.

It's scary how we get used to facilities such as broadband though. Those 6 weeks really felt more like 6 months of frustration. Every time I clicked a link or a button on a page, I had no guarantee that it would load properly. Interestingly enough, the pages that were most affected by the problem were pages that depended on AJAX, GMail in particular. I suspect that this is because, when a normal page fails to load completely, the only downside is missing images and suchlike. The page is still usable. But an AJAX page can be completely crippled when it can't load everything: some essential functionality is missing. In fact, I saw exactly the same problem at work recently on my current project: a web application that depends heavily on Javascript and is completely broken if some of the Javascript source files fail to load properly. So, there's a moral for all AJAX developers out there: one of the rules to follow to build bulletproof AJAX is to ensure that your application still works, even if some of the code fails to load.

Following this, I just created a project on SourceForge, where I will maintain the script I developed during this incident. Obviously, this assumes the project gets approved by SourceForge. Assuming it does, I hope this tool proves useful to others and if anybody can contribute by testing it on other operating systems such as Linux or Solaris and helping me make it generic enough to run on those, it'd be much appreciated.

No comments: