Friday, January 02, 2009

Major external routing issue

There is a major external internet routing issue affecting all broadband customers that started just after 10pm, we are still trying to identify the cause.

30 Follow-up Messages (Posted by AAISP Staff):

AAISP said...

We have now had two reports of severe routing problems to and from ISPs that do not involve us at all suggesting this is a much more widespread issue that just our routers.

AAISP said...

The situation is still somewhat unclear - we are still trying to find the cause and rectify this ASAP

AAISP said...

It looks like things are more stable now - but stable in the broken totally rather than up and down like a yo-yo. However, we are now getting two core routers rebooted to try and restore services fully.

AAISP said...

Working theory is that something screwy has happened on the internet generally and it has left both of our external routers in a broken state somehow. The fact that we have many reports of other problems elsewhere fits with this. We'll probably know more when service is restored.

AAISP said...

one router is back, and we have connectivity. Restoring second router shortly.

AAISP said...

Both of the external facing routers suffered the same fatal problem and we have some diagnostics which should help us identify the cause.

AAISP said...

Ok, we do not know what triggered the problem, but it appears one of our core routers got in a state where it had kernel problems. We are investigating if this needs changing ASAP. However, the effect was to flap all routes on the internet so much that all routing was damped. This meant that even though we had a working router the service stopped working. We are looking through the effect in some detail to see if this can be avoided in future.

AAISP said...

The problem appears to have happened again at 4am and we are working on it.

AAISP said...

The offending router has been restarted. It will be restarted shortly with an older kernel.

AAISP said...

We are back on both routers now, and will be rebooting one to change to an older kernel shortly. This should be pretty seamless.

AAISP said...

We have reason to believe the issue we have seen is a kernel issue, and so this should solve the problem. However, we are also investigating why one of the routers having this problem was able to take out the whole network when the other router was apparently fine.

AAISP said...

We think we are getting closer to the cause, and are seeing an issue now too which we are looking in to.

AAISP said...

We are just awaiting a call from teh data centre.

AAISP said...

OK we think we may now have found the cause of the original problem - only time will tell. This appears to be a problem with a pair of network switches in telehouse which has led to a sting of knock on effects. Thank you all for your patience.

AAISP said...

Post mortem...

This looks like a switch loop issue which was caused storms on the LAN killing traffic enough to drop BGP sessions. This appears to have been the issue that has caused routing "blips" over the last few days and earlier in the month.

We suspected BGP was the cause and so had upgraded routers to the latest kernel, etc. It seems the latest kernel has a bug meaning that the route cache fails when all BGP routes drop and re-establish like this, and that led to last nights total outage with a key router claiming to be active and working but in fact not forwarding packets. We reverted the kernel at 6am after a second outage at 4am.

The loop seems to be caused by a disabled switch port actually being in use somehow. It has been physically disconnected to restore service now.

The outage today was a result of us disabling spanning tree on switches which did not need it (because the only looped port was disabled). This immediately created the problem which had previously been intermittent. This highlighted the underlying problem.

We are also reviewing exactly which kernel should be deployed. We'll schedule updates in due course.

In the longer term we have the FB6000 BGP router project well under way and expect to be replacing all of our routers later in the year. This will put things much more under control and allow a much cleaner router and switch arrangement than we have now making switch issues less likely to cause problems.

As always every issue like this teaches us more and allows us to avoid problems in the future.

We'd like to thank everyone for their patience while we resolved this issue, and my team working through the night on this.

AAISP said...

Further investigation suggests it is a known kernel bug that we encountered last night. One of our routers is currently running an older kernel which is known to be stable and not have the bug. The other is now running a slightly newer one which is also known not to have the bug.

AAISP said...

We had another blip, twice. Both very short, but suggests that we have not actually managed to find the underlying cause yet. We are still investigating.

AAISP said...

Maybe there is more than one thing happening here. This was not the same as the issues last night. We still had clean routing to one of our routers but not a good link between the two data centres it seems.

AAISP said...

All working now, but being monitored carefully. We have changed one link which we put in last month in case that could be the cause too. This is very strange.

AAISP said...

We are continuing to monitor the network - no more issues since 18:40. We have alternative routing allowing us access to both data centres and monitoring of the interlink that we suspect at present. If this happens again we will have a clearer idea where it is.

AAISP said...

OK, that was 90 seconds, and we have confirmed it is definitely the link between the two data centres that is dropping. We are contacting the link provider now.

AAISP said...

We are now seeing hardware errors on the interlink ports

AAISP said...

We are waiting on a reply from the link providers now. The set up now should mean any issues recover right away, so if there is a further failure over night it should only be a few seconds.

AAISP said...

Just to add, the issue is only affecting 20CN lines at present. 21CN lines go direct to the other data centre and should be unaffected. However, it may affect 21CN lines ability to connect in the first place during an outage.

AAISP said...

It is also worth clarifying we have located, identified, and actually fixed a number of issues with this today. The final step of the interlink problem looks like the root cause and should be something that can be solved quick quickly - though likely to be tomorrow now.

AAISP said...

The link provider has seen flaps on the link and we are awaiting a call now.

AAISP said...

Just to explain, this link between data centres was meant to be a short term link during the trial for 21CN links, to be removed when moving all lines over. The delays with BT IPStream connect mean we will have this for many more months and this is why it is so crucial.

AAISP said...

There will be an outage of a few seconds now while we reconfigure.

AAISP said...

The link suppliers have confirmed they are seeing link flaps and will be contacting the hardware vendors as this has affected more people than just us. We have reconfigured the link and should not longer suffer from this problem. Finally.

AAISP said...

The fact that this issue affected more than just us may explain why some other UK ISPs had routing issues at the same time - this was one of the things that made no sense when we thought it was a loop causing it. Seems the loop was a red herring after all. The issues with which kernel to use were important to resolve anyway as that was a serious issue which showed up really badly when this fault happened.

Thank you (again) for your patience.

Good night.