It is worth explaining the problem as I am sure there are many that want to know.
In summary - one tricky problem which has hit us again and we have a work around for, one simpler problem we havd definitely fixed.
We have deployed new L2TP LNS code. The reasons for this are various, and it has been planned for a while. The main drive is to allow us to add a lot of new features, but there were some issues with BT over the last month and we wanted to be able to sort those. The new code allowed various work arounds and additional diagnostics.
The new code has had two main issues.
1. Attack management.
Basically, with almost any system, it is possible to overload it with enough packets, especially small packets. The system is designed to handle this, but the system was not quite right, and so this caused watchdog errors. The front end handling has been put in place quickly and that initial problem was solved.
However, this moved the problem on - the front end could handle the packets but specific types of packets are handled by the control processor. Again, storms of packets coming in way too fast for it. We have systems in place to limit traffic but there was a small window where certain packets at just the right rate caused a watchdog error. Faster or slower or bigger packets and it would not - it was packets at just the wrong rate. The initial step was a very crude limit to stop this. It was not elegant but worked.
We since have put in a more elegant way of handling this, and we have today discovered it may be more elegant but it does not work. For now the crude limit is being reinstated.
What are these attacks though? Well, we have found out. They are not actually attacks at all. They are some sort of storm on the LAN. Thousands of small ICMPv6 packets bouncing around between switches in a fraction of a second. Sometimes. Not all the time. It is clearly a switch being stupid. We are planning to replace all of the switches involved over the next few weeks as this is clearly unacceptable, even if it has highlighted an issue.
This is why it has been impossible to reproduce the specific problem on the bench.
2. Monday's issue...
On Monday lunchtime we had both LNS's crash repeatably. Thankfully this too was something we managed to track down. It was a simple matter of customers accesisng the graphs produced by the LNS from a LAN with a very large MTU. Yesterday, having set my desktop machine to a large MTU and not set it back I inadvertantly tested that this was indeed the cause of the problem. The fix was simple and has been deployed.
What now?
Well, the management of these packet storms did not work correctly, and so we are issuing a 1GB peak time topup on all of the 21CN logins today. I really hope this will be the last.
We will be running newer LNS code on the other LNS, probably later today, and moving people over to it. This will the cruder fix for this issue in place. We'll go back to the drawing board and work out how to fix it properly for a later deployment. We'll see if there is any way we can reproduce the cause on the bench as well.
I hope the explanation is useful.
Please do contact me on irc to discuss in more detail.
Adrian
Director.
Thursday, July 02, 2009
Subscribe to:
Post Comments (Atom)
0 Follow-up Messages (Posted by AAISP Staff):
Post a Comment