This is the Message Centre for Jim Lynn

Service Unavailable

Post 1

Jim Lynn

Sometimes, programming is like fighting the Hydra - cut off one head, and two more take its place.*

Following the server upgrade, and the tweaks we made to support the traffic from 606 and others, the service has been performing pretty well. Most of the time. But there were occasions when we'd start giving 'Service Unavailable' messages or, before that, when it would just slow down to a crawl. The big problem has been there isn't one single reason for all these faults. But here's roughly what we've found in the last few weeks (basically since the upgrade).

Problem 1: Search sometimes goes mad. We use the full-text search engine supplied with SQL Server. (I was about to write 'built in to SQL Server' but that wouldn't be entirely accurate.) It's OK. We've grown used to its foibles, and massaged it a bit to get the best results out of it, but it's not good. Before the servers were upgraded, it used to randomly decide to corrupt the indexes when there was a lot of disk activity going on, and on the old database server, we regularly reached the maximum disk bandwidth, especially when backing up. This was one of the primary drivers in the server upgrade.

Since the upgrade, the index hasn't been corrupted, but now we're seeing something that we hadn't seen on the old hardware. Occasionally, and randomly, search queries would cause the search engine to go into some kind of loop, consuming all the CPU on that thread. This wasn't immediately catastrophic, because the database server has four, hyperthreaded CPUs, so it could cope. But if we were unlucky enough to have this happen three or four times (and as I said, there doesn't appear to be any particular query that makes it happen) eventually it would make the database server lock up until the search process is killed. This leads to the webservers not being able to serve pages and a dead site. In these situations we just have to kill and restart search to fix it. So we now do that automatically, just in case.

The next problem we noticed was in our startup sequence. When the web server application starts up, it has to read a lot of static information from the database before it can start processing queries. This code used to be inefficient, and was often running multiple threads which were all reading the static information. We fixed the problems there so that only the first request will initialise the static data, and until it finishes, all other queries will get that strange Ripleyserver Config File error that comes out in html (I'm fixing that). Once the first request has finished, all subsequent requests can run as normal.

But our startup still wasn't right. If we ever had to restart the servers, we'd see the number of running requests quickly go higher and higher, and not drop down, often going over 100 requests processing. Once it gets to this many requests, the machine is spending so much time switching between requests that it can't cope, and it never recovers. So we put in monitoring which would restart the server if it got to 100 requests pending. This helped, but it still took a long time (several minutes) for the servers to settle down. There were a couple of reasons for this. First, when we start up, we don't have any pooled connections to SSO, so we have to suffer the strange 5 second delays that connecting to SSO entails. These mount up, leading to lots of requests, leading (once again) to slow performance.

A worse problem is the way we use XSLT to handle the different skins on different sites. Our XSLT stylesheets are pretty complex - getting on for 1MB of text. In order to transform XML into the HTML pages we serve, we have to:
1: take that 1MB of text
2: build an XML document from it
3: Create an XML Template from that document
4: Create an XSL processor from the template
5: use the processor to transform the XML.

Steps 1 to 3 take a non-trivial length of time, and we were finding that it became much worse when many templates were being created at once, which is what happens when the server starts up. We cache the template created in step three, so when we have them all cached, the transforms are very quick. But during startup, it became clear that there was a lot more work going on than should strictly be necessary. I found an article describing just these kinds of problems, which said that these actions can lock up under stress, but would eventually settle down, which is what we were seeing. Trouble is, it would take a few minutes for each server to settle down. And this would happen whenever we updated our stylesheets (which, with so many sites, happens quite regularly).

So, in desperation, I decided to try the newer version of the XSLT component. We'd stuck with version 3 for years now, because in tests we found we were able to make version 4 crash completely given a particularly complex stylesheet. But version 4 is supposed to be much faster, so I thought it was time to try the latest version. I built two versions of the app, one with version 3, one with version 4. Then I tested it on a spare server, under simulated load. Version 3 would, as normal, take a long time (three or four minutes typically) before it settled down into the normal quiet pattern, which is what we tend to see on live. Then I tested the version 4 one, and I thought I must have made some mistake in the testing, because it settled down within 15 seconds (and during that time didn't seem to give any Service Unavailable errors). But I tested it again, with the old and the new version, and it was definitely right.

So we're now testing this code on four of our web servers, and tomorrow I'll look at the error logs. If they've had fewer SA errors than the unpatched ones, then we'll put that code on all of them, and hopefully things will become a lot more reliable.

Until the next two bugs show up which were hidden by this one.

* Literary allusion courtesy of Marvel Comics. What do you mean the Greeks used it first?


Service Unavailable

Post 2

Frankie Roberto

Hey, thanks Jim for such a detailed description of the processes you've been going through. I hope I never have to delve that deep into server processes...

I have been working with XSLT for some time now though (using PHP's Sablotron extenstion), and will be using it in my new job* (under ASP.net, or something). From memory, you're using a Microsoft XSLT parser, right?

I didn't quite understand the process you described:

1: take that 1MB of text
2: build an XML document from it
3: Create an XML Template from that document
4: Create an XSL processor from the template
5: use the processor to transform the XML.

In PHP, the process I use seems to go like this:

1. Create XML (using PHP with data from MySQL queries)
2. Create an XSLT processor (using PHP's xslt_create().)
3. Process XML, using XML and XSLT file (using xslt_process().)

Is this the same process? Am I missing something?

Frankie

* I'll be able to announce my job from Monday, when I've started. It does involve hitchhikers though...


Service Unavailable

Post 3

Jim Lynn

I forgot another problem, the one which was actually causing 90% of the recent Service Unavailable errors.

These would suddenly affect all seven web servers at the same time. They'd all give Service Unavailable errors, but when I checked our application logs, the servers weren't actually processing any requests. they were just sitting there, waiting to do something, and all the while just batting back all requests with the Service Unavailable error (which in this case should imply 'Server too busy'.

These had started shortly after we'd set up a limit of 20 connections per server, meaning that if the server were already handling 20 requests, it would refuse any more with the SA error. We picked 20 because that matched the number of pooled SSO connections we had. And under normal use, even at our busiest times, we never saw the number of requests rise much higher than 10. So it should all have been fine. The reason for limiting connections in the first place was to help with the startup problem. Limiting to 20 connections meant the server was never able to get completely swamped by requests while starting up, so it settled down quicker.

But now we were getting these strange periods, usually a minute or so, where all servers gave SA errors. Yet they clearly weren't processing any requests. Our application logs would show one request completing, then a long gap, then requests come back in. It was most mysterious. We weren't hanging up, our database wasn't hanging up and yet we were too busy.

I thought perhaps it was a bug in IIS's connection limiting, but that wouldn't explain why it affected all servers at once.

Then, I noticed that after this period of inactivity, there always seemed to be some requests in the log which had apparently taken a long time to complete. Some as long as 30 or 40 seconds. So I checked the application logs to see why those particular requests were taking so long, and I discovered a very strange thing: Our logs show those requests taking a fraction of a second. Yet the IIS log says they took 40 seconds.

Then I realised what must have been happening. Once our application has built the page to send back to the user, we pass that page back to IIS. Then we exit and wait for another request. So as far as we're concerned, there's no request pending. But IIS is still trying to send the result back to the client, so it keeps the connection open until it has finished. So if the client (in our case this is the BBC's head-end proxy servers) has stopped accepting data, IIS waits politely until it is ready. But for that time, it still has one connection open. If more than 20 of these connections are waiting for the proxies, then we'd start giving Server Unavailable errors.

So, another tweak we've made to this verision is to limit the number of connections inside our application itself, rather than in IIS. This means we don't care if some connections are waiting for clients to receive data as long as the application itself has a steady stream of requests coming through.

And we've also upped the limit to 50 per server. We don't get close to this kind of load, but having a higher limit will help us cope when other things slow down.

Which reminds me of the other thing that can lead to Service Unavailable problems. We have to talk to the Single-Sign-On database to check that you are who you say you are. Now that we've got connection pooling for that, this is usually a very quick process, but sometimes (yesterday afternoon was an example) the SSO database starts running really slowly. We don't yet know why this is - the team that manage that server are still investigating - but it means that every page request we get is slowed down for several seconds or more by waiting for SSO. This meant, given the 20 connection limit, that we'd be giving SA errors. Sadly, there's no sensible way to work around this particular issue, so we have to hope they find out why it slows down.

And a final addendum to the post - I've just checked the servers to see how they're doing, and the first thing I noticed is that the ones running the fixed version are taking up a quarter of the memory of the old version. So the new XSLT component seems to be a massive improvement on the old one. Fingers crossed.


Service Unavailable

Post 4

Jim Lynn

"I didn't quite understand the process you described:

1: take that 1MB of text
2: build an XML document from it
3: Create an XML Template from that document
4: Create an XSL processor from the template
5: use the processor to transform the XML.

In PHP, the process I use seems to go like this:

1. Create XML (using PHP with data from MySQL queries)
2. Create an XSLT processor (using PHP's xslt_create().)
3. Process XML, using XML and XSLT file (using xslt_process().)

Is this the same process? Am I missing something?"

It's the same process. We can do it that way, but creating the intermediate template and caching that makes the process much quicker.

When we first launched the Ripley codebase, we were doing it the way you do it - parsing the stylesheet on every request. For the traffic we were getting, we didn't really notice the delays, but when we moved to the BBC we started caching the templates - it made a big difference at the time.


Service Unavailable

Post 5

Jim Lynn

That Hydra's back again. The new version of the XSLT component has a limit to the amount of recursion you can do. And since recursion is the only way to do loops in XSLT, any loops that hit that limit will fail. This was causing some long threads not to display because of the way we render the paging blobs.

I've reverted to the old code (cue more Service Unavailable errors) until I can fix the XSLT so it works again.


Service Unavailable

Post 6

Jim Lynn

OK, let's try that again.

The new code is back up. I've fixed the blob rendering so it doesn't break. We'll see how it copes.


Service Unavailable

Post 7

Jim Lynn

That was interesting. I just got a Service Unavailable. Which shouldn't now be possible. So I checked the graphs, and the number of connections to the web servers was increasing dramatically, but the number of application requests being processed was staying flat. So obviously, whichever head-end server is causing problems was doing its thing just now.

And the reason I got Service Unavailable was that I had neglected to remove the 20 connection limit on one of the servers. All the others were fine, and didn't give a single error. So that, at least, proves that one part of our fix is working.


Service Unavailable

Post 8

Frankie Roberto

"It's the same process. We can do it that way, but creating the intermediate template and caching that makes the process much quicker."

Ah, I see, thanks. I'm suprised that the sablotron extension hasn't knocked my host's server out (I've had a few traffic peaks), but there's never been a problem. The Moveable Type comment script, on the other hand, has caused infinite problems...

Frankie
P.S This thread looks much better in points of view (apart from the missing 'reply' buttons). Can you add the quoting feature to h2g2 templates?


Service Unavailable

Post 9

Jim Lynn

I'd like to see the quoting stuff work, but it's a matter of resources. Jim and Natalie are the best people to ask about skin changes, as they have to ask for them in the first place. I only get involved with emergency fixes, usually.


Service Unavailable

Post 10

Jim Lynn

Just to keep this thread up to date, yesterday evening saw a total failure of the site for fifteen minutes (the time it took for Peta to notice, ring me, and for me to log in to diagnose the problem). This time it was giving 'Not Implemented' errors.

It was, once again, our old friend Search. One or more searches had caused the search service to lock up, which caused a knock-on effect to the rest of SQL Server, eventually blocking all requests to the database server. Stopping the search service made everything spring into life immediately.

Unfortunately, I can no longer reproduce this problem. I could do it last night, but the same combination of circumstances now works properly. So we're left with something which can happen randomly, at any time. Which I hate.

I've already set up a task to kill the search service once an hour. Perhaps it would be safer to kill it more regularly than that.


Service Unavailable

Post 11

Traveller in Time Reporting Bugs -o-o- Broken the chain of Pliny -o-o- Hired

Traveller in Time smiley - tit not searching
"Could you limit the 'Search' to use one processor? "


Service Unavailable

Post 12

Jim Lynn

In this case, it *was* only using one processor. But it was still locking up the database.


Service Unavailable

Post 13

Galaxy Babe - eclectic editor

The site seems to be behaving perfectly today smiley - smiley

Not had one "Service Unavailable"smiley - ok

smiley - galaxy


Service Unavailable

Post 14

Jim Lynn

I'm glad to hear it. smiley - smiley


Service Unavailable

Post 15

Whisky

A question for you Jim, from a complete programming novice...

Does software ever become that complicated that it becomes inefficient to keep fixing bugs (i.e. fixing one bug can create (or highlight) another somewhere else in the programming).

Do you ever get to the stage when you've got to give up and start again from scratch or can you continue to develop one piece of software indefinitely?


Service Unavailable

Post 16

Jim Lynn

Giving up and starting from scratch introduces a whole new set of ways to fail. The thing about a pre-existing system is the gradual accumulation of bug fixes. A lot of the time, they won't even be documented, just fixed in the code.

We've rewritten the h2g2 codebase from scratch once already, and in doing so introduced plenty of bugs. Some of the programmers would tend to reimplement what they thought the system was doing, rather than examining the existing code and ensuring that *all* logic was replicated. So some features were missing.

There are often compelling reasons for starting from scratch - usually because you want to move to a different technology or platform, but to assume that doing so will always be a good thing is dangerous. There's always a significant cost, and they likelyhood of significant problems during the migration process is extremely high.

We've been running the same codebase now for five years. It was reasonably well designed to begin with, so development hasn't been a matter of bodging things. In theory, you could continue development indefinitely, but tools change, become unsupported. In the web arena, even your base technology might become unsupported in time. So sometimes you might have little choice.

But starting from scratch just because you think the code is getting a little bit unwieldy is probably a bad idea.


Key: Complain about this post

More Conversations for Jim Lynn

Write an Entry

"The Hitchhiker's Guide to the Galaxy is a wholly remarkable book. It has been compiled and recompiled many times and under many different editorships. It contains contributions from countless numbers of travellers and researchers."

Write an entry
Read more