Journal Entries

Season's greetings

It's that time of year again, when we celebrate the birth of the saviour who came to us from the heavens, and who laid down his life for us.

But enough about Doctor Who.

My page is now adorned with this year's Christmas picture. We've been a bit slack this year, so it's very late. We use this picture to make cards, and rather than try getting our aging colour printer to produce acceptable results, I thought I'd try getting prints made from a digital picture at one of the local photo printers. I tried Bonusprint, but the (rather harried looking) young man there said they would have to send them out to be processed, and wouldn't guarantee to get them back until Thursday. So we tried Boots, where we just put the CD into their machine, picked the right picture, chose 20 prints and that was all we needed to do. The prints were ready in an hour, and we were able to spend the rest of the afternoon making the cards, with plenty of help from the children. It's nice when high-street technology matches your expectations.

Happy Christmas everyone.

Discuss this Journal entry [10]

Latest reply: Dec 22, 2005

Welcome to Hollywood

I'm posting this from the Hollywood Roosevelt Hotel, on Hollywood Boulevard. Sadly, it's not because my starring turn in the Hitchhiker's movie was recognised by a talent scout. I'm here working. I'm going to the Microsoft Professional Developer's Conference for the next week, to hopefully find out about all the new stuff we could use for DNA in the future.

My entry into the USA was interesting. The man on the Immigration desk saw the copy of Last Chance to See I was reading and asked me what it was about. When I explained, he looked very serious, and asked me if I knew about the North American Sealion. I said I wasn't aware it was endangered, so he told me very solemnly that he was going to tell me something about how we treat sealions, and that it would shock me. He then described the culling of seal cubs, something I knew about but hadn't somehow connected with sealions. He then said that this was one of the things that caused him to become a Wiccan. Which is, frankly, absolutely the last thing I would have expected to hear from a US Border Guard. But a pleasingly odd welcome to the US.

The flight had the compulsory celebrity. Paul Weller, off of The Jam was there, and frankly, the blond highlights were a mistake. But it was good to see he had to queue with the rest of us.

Strange coincidences abounded. My seat number was 42H. Bernadette suggested that the check-in clerk might have recognised my Marvin T-Shirt, but she didn't seem to have.

And as I was waiting to board, someone walked past wearing an h2g2 T-Shirt. One of the old ones, too. I saw him again at the luggage claim, and I had to comment, at which point he said 'You're Jim Lynn.' It was IanG who I'd worked with at Computer Concepts, and who was here for the PDC too.

So, I've just had a couple of hours wandering around outside the hotel. Graumann's Chinese Theatre and the Kodak Theatre are just across the road, you can see the Hollywood sign in the distance. I'd expected some anonymous hotel in the middle of nowhere, but this hotel was where the first Academy Awards were held, so I really am in the middle of Hollywood.

Now it's time for bed. It's only 8pm, but it feels so much later...

Discuss this Journal entry [7]

Latest reply: Sep 11, 2005

Service Unavailable

Sometimes, programming is like fighting the Hydra - cut off one head, and two more take its place.*

Following the server upgrade, and the tweaks we made to support the traffic from 606 and others, the service has been performing pretty well. Most of the time. But there were occasions when we'd start giving 'Service Unavailable' messages or, before that, when it would just slow down to a crawl. The big problem has been there isn't one single reason for all these faults. But here's roughly what we've found in the last few weeks (basically since the upgrade).

Problem 1: Search sometimes goes mad. We use the full-text search engine supplied with SQL Server. (I was about to write 'built in to SQL Server' but that wouldn't be entirely accurate.) It's OK. We've grown used to its foibles, and massaged it a bit to get the best results out of it, but it's not good. Before the servers were upgraded, it used to randomly decide to corrupt the indexes when there was a lot of disk activity going on, and on the old database server, we regularly reached the maximum disk bandwidth, especially when backing up. This was one of the primary drivers in the server upgrade.

Since the upgrade, the index hasn't been corrupted, but now we're seeing something that we hadn't seen on the old hardware. Occasionally, and randomly, search queries would cause the search engine to go into some kind of loop, consuming all the CPU on that thread. This wasn't immediately catastrophic, because the database server has four, hyperthreaded CPUs, so it could cope. But if we were unlucky enough to have this happen three or four times (and as I said, there doesn't appear to be any particular query that makes it happen) eventually it would make the database server lock up until the search process is killed. This leads to the webservers not being able to serve pages and a dead site. In these situations we just have to kill and restart search to fix it. So we now do that automatically, just in case.

The next problem we noticed was in our startup sequence. When the web server application starts up, it has to read a lot of static information from the database before it can start processing queries. This code used to be inefficient, and was often running multiple threads which were all reading the static information. We fixed the problems there so that only the first request will initialise the static data, and until it finishes, all other queries will get that strange Ripleyserver Config File error that comes out in html (I'm fixing that). Once the first request has finished, all subsequent requests can run as normal.

But our startup still wasn't right. If we ever had to restart the servers, we'd see the number of running requests quickly go higher and higher, and not drop down, often going over 100 requests processing. Once it gets to this many requests, the machine is spending so much time switching between requests that it can't cope, and it never recovers. So we put in monitoring which would restart the server if it got to 100 requests pending. This helped, but it still took a long time (several minutes) for the servers to settle down. There were a couple of reasons for this. First, when we start up, we don't have any pooled connections to SSO, so we have to suffer the strange 5 second delays that connecting to SSO entails. These mount up, leading to lots of requests, leading (once again) to slow performance.

A worse problem is the way we use XSLT to handle the different skins on different sites. Our XSLT stylesheets are pretty complex - getting on for 1MB of text. In order to transform XML into the HTML pages we serve, we have to:
1: take that 1MB of text
2: build an XML document from it
3: Create an XML Template from that document
4: Create an XSL processor from the template
5: use the processor to transform the XML.

Steps 1 to 3 take a non-trivial length of time, and we were finding that it became much worse when many templates were being created at once, which is what happens when the server starts up. We cache the template created in step three, so when we have them all cached, the transforms are very quick. But during startup, it became clear that there was a lot more work going on than should strictly be necessary. I found an article describing just these kinds of problems, which said that these actions can lock up under stress, but would eventually settle down, which is what we were seeing. Trouble is, it would take a few minutes for each server to settle down. And this would happen whenever we updated our stylesheets (which, with so many sites, happens quite regularly).

So, in desperation, I decided to try the newer version of the XSLT component. We'd stuck with version 3 for years now, because in tests we found we were able to make version 4 crash completely given a particularly complex stylesheet. But version 4 is supposed to be much faster, so I thought it was time to try the latest version. I built two versions of the app, one with version 3, one with version 4. Then I tested it on a spare server, under simulated load. Version 3 would, as normal, take a long time (three or four minutes typically) before it settled down into the normal quiet pattern, which is what we tend to see on live. Then I tested the version 4 one, and I thought I must have made some mistake in the testing, because it settled down within 15 seconds (and during that time didn't seem to give any Service Unavailable errors). But I tested it again, with the old and the new version, and it was definitely right.

So we're now testing this code on four of our web servers, and tomorrow I'll look at the error logs. If they've had fewer SA errors than the unpatched ones, then we'll put that code on all of them, and hopefully things will become a lot more reliable.

Until the next two bugs show up which were hidden by this one.

* Literary allusion courtesy of Marvel Comics. What do you mean the Greeks used it first?

Discuss this Journal entry [16]

Latest reply: Aug 10, 2005

Unleash

When I worked at Computer Concepts, and we wrote applications for the Acorn Archimedes, people would often remark on their speed and ease of use, and wondered what techniques we used to achieve these results.

Someone (possibly Sean, U7, although it might have been Tim, U1) suggested we write an article in our in-house magazine revealing that our secret was that we used the secret, undocumented 'unleash' command in the ARM processor which unleashes the raw power of the ARM. But we never did.

I feel slightly as if we've just used the 'unleash' switch on the DNA servers. I don't know if anyone's noticed, but in the last few days, the servers have been running really, really fast. I compared logs from yesterday to logs from two weeks ago, and there's been a three times improvement on response times - average response time has dropped from 2.13 seconds to 0.77 seconds. And, somewhat paradoxically, over the same period the number of requests has tripled (due to all the new messageboards moving to the DNA servers).

There's a slightly frustrating aspect to this, though. It goes back to when we moved to using Single-Sign-On. For SSO we have to talk to the membership database to recognise you each time you fetch a page, which requires a connection to the SSO database. When SSO launched, we had 'connection pooling' code, which would keep a pool of these connections, and reuse them as necessary, but on activating the service, the SSO database promptly fell over because it couldn't cope with all the connections the DNA servers were creating and holding open. So, just to get the site back up and running, we disabled the connection pooling, so that we'd only create connections when we needed them. This solved the problem, and the site was back up and running.

However, I kept noticing that the connections to SSO would occasionally take a long time - several seconds. Not all the time, but enough to be annoying. But so far, it hasn't been a problem.

Recently, however, we've been worrying a lot about performance, so we put in some more monitoring, and discovered something rather odd. The graphs were showing a strangely regular pattern - a flat line for a time, then a spike, as requests were building up on the server, then a sudden drop as all those requests finish almost simultaneously.

We then put together a test program which simply made and broke connections over and over again, displaying how long it took. And it became clear that there was this regular delay. Every 16 seconds it would wait for about 5 seconds before completing the connection. Every 16 seconds.

Now, this is slightly worrying. We're days before moving the BBC's biggest and most fearsome messageboard onto DNA - 606, a sport board, which, going by its current traffic, was likely to double our load. This, on top of these SSO delays, was threatening to cripple us.

So we went back to the connection pooling code we'd written originally, made sure it still made sense, set the pool size to something reasonable so as not to overwhelm SSO (they'd increased their maximum number of connections several times since launch so we were happy we wouldn't have a repeat of the first time).

I tested the change on the staging server, it worked OK, so, very late at night (when server load is light) I put these changes live.

The result was dramatic. Suddenly, pages were returning immediately. All of them. Where there used to be delays, now pretty much every page was coming back immediately.

The graphs were showing that all the spikes we'd been seeing were gone. Which was nice. But that was late at night with quite low traffic. Would it still behave that way in the morning? Would it somehow be worse when the relentless hordes of footballdom descended later in the day?

It's been fine ever since. We watched Who's Online hit 1005 on Wednesday with no noticeable effect on speed. Then it got close to 1400 yesterday and only slightly less today, still with the servers zipping along.

I don't want to jinx it, since 606 haven't fully moved across yet, so there might be more load to come, but right now, it's looking quite promising. Even at our busiest time, the database server is only running somewhere between 5 and 10% CPU capacity. And all this is without applying all the database optimisations which (according to my tests) could speed up the database by a factor of two or three.

Discuss this Journal entry [8]

Latest reply: Jul 29, 2005

Emails of the rich and famous

It's one of the odder consequences of having worked with Douglas that occasionally I come into contact with actual, famous people. Now, being cool, calm and collected, I try not to gibber and drool uncontrollably, while muttering 'I'm your number one fan...' In fact, I'm usually coherent and reasonable, while my brain is silently screaming 'OMG! That's Him! That bloke off of Telly!'

So, when I received an email today from John Lloyd, TV Producer, old friend of Douglas Adams, creator of QI (A2181782), co-writer of 'The Meaning of Liff', co-writer of episodes 5 and 6 of the Hitchhiker's radio show (creator of the Haggunennons), co-creator of Spitting Image (A3700234) and Not the Nine O'Clock News (A401851), I couldn't help feeling a little excited. Perhaps he'd got wind of the amazing sit-com concept I've been mulling over? Sadly, no. In fact, he was after some help in a local project he and his wife are involved in - trying to raise money to build a stage in their local Village Hall. They've entered a local radio competition (you can see the finalists here: http://www.foxfm.co.uk/Article.asp?id=85146 - theirs is West Hendred Village Hall) and their success depends on people supporting their entry, so if you're local and think it's a worthwhile project (and I can attest that having a stage is a massive boon to the ways such a hall can be used) why not support them? All the details are on the web page.

If nothing else, it proves that famous people are real people too. Which I always find reassuring.

Now, about that sit-com...

Discuss this Journal entry [6]

Latest reply: May 19, 2005


Back to Jim Lynn's Personal Space Home

Jim Lynn

Researcher U6

Work Edited by h2g2

Write an Entry

"The Hitchhiker's Guide to the Galaxy is a wholly remarkable book. It has been compiled and recompiled many times and under many different editorships. It contains contributions from countless numbers of travellers and researchers."

Write an entry
Read more