Friday, July 03, 2009

[info] bug hunting

We are still trying to sort a network issue affecting 21CN services. We have clear evidence of some sort of network storm on the LAN in Telehouse. We have been unable to identify the source. We suspect a switch issue, we are planning to replace switches with some shiny new Allied Telesis models. However, our equipment should not fail in light of this occasional nuisance. We have worked through a number of techniques for addressing the problem which is fundamentally the same as handling a denial of service attack. We keep pushing the problem to the next bit of code. In the past we have managed attacks well and we should be able to sort this.

However, we have the development team (including me) working over the weekend, and with the testers on the test LNS, to get to the bottom of this. We are putting in even more diagnostics and even more rate controls at different levels in the system to ensure we can handle the problem without a restart. We are determined to get to the bottom of the issue.

Having issued a "stability" guarantee from 1st July, we have "paid out" by applying two extra 1GB peak usage top-ups to affected 21CN logins. I am sure this will not seem much to some of you, but this is very costly for us as a surprising number of people pay top-up usage each month so this can equate to many thousands of pounds. We don't really need this incentive - the quality and stability of our service is paramount and we have a reputation to regain for reliability.

This is made even more important by the fact we hope to move all lines to the new 21CN links to us some time next month. A big task, and one that is over a year after we expected. It is essential we have 100% reliable LNSs for our service now and in the future. Do not worry - we will delay this if we do not have stable LNSs.

We do apologise, most sincerely, for the issues. We know that for many this is a matter of seconds of outage every few days. We know a few of you have had minutes of outage (an issue we are taking up with BT as it should recover instantly, our end does!). We know that even a few seconds is not acceptable.

Testers on the test LNS (simple prefix your login with test- on 21CN) are invaluable in confirming the beta releases of new LNS code are stable and working. You get all usage counting 50% for billing purposes. Feel free to join in if you wish. You can change login back at any time.

I'll post follow-ups to this with details of the progress we are making.

Please do bear with us. If you are fed up, please do call or irc or email the specially set up mailing address we posted. If you leave, please do look again in a few weeks. You may find that you should have stayed, and we'll welcome you back.

And thank you all of our customers. We appreciate your support in difficult times.

5 Follow-up Messages (Posted by AAISP Staff):

AAISP said...

Just to update you - there are two of us working on this at present. We are pretty sure we have it sussed, but we have said that before! We have a lot of checking to do. We'll almost certainly load new code again tonight, so a ppp restart between 2am to 3am.

AAISP said...

We still hope to release new code for initial testing later tonight. There is a lot of checking being done and a lot of code inspection and testing being done first. We are really keen to get to the bottom of this. We have identified two likely causes for the failure and addressed both, but are still working on this to be sure. Two developers are still working on this even now.

AAISP said...

We have identified a further small number of problems which are likely to have contributed to the recent instability. We want to perform more testing before switching users to the latest software, so we shall not now be updating the main 21CN LNS overnight Sat/Sun. Testing and deployment of new software will continue tomorrow (Sunday).

AAISP said...

Further work over night means we now have a new version of code. There is some more checking going on today to be sure we have not missed anything. In the mean time this is being loaded on the test LNS now and will be put in live use later today with lines moving over to it over night.

It is looking like the fix we did to handle high packet load was OK all along and that the problem was a slightly different one which we happened to pick up separately on Friday anyway. The symptoms were the same. This explains why it continued.

We have also identified a couple of very very unlikely race conditions which we have not seen, but have now been addressed.

Whilst we expect the new code to be stable, we are being cautious. The guarantee still stands.

I'll update later today with progress.

Thank you all for your patience.

AAISP said...

We are now running new code on b.gormless, and lines that reconnect will go on to this LNS. We will switch lines over around 2am.