This is a Journal entry by Jim Lynn
I've got a theory...
Jim Lynn Started conversation Dec 12, 2002
...it could be bunnies^H^H^H^H^H^H^H spiders.
I've just got the latest log files from the servers, and have started looking for the reason our servers are under such unexpected load, when our weekly stats don't show anything out of the ordinary. As often happens, around the time the server fails, there are plenty of requests from search engine spiders (Google being particularly busy). This would explain why our figures aren't increasing but our load is, but it wouldn't explain why it's worse since the upgrade (which it most definitely is).
Then, at about 2am this morning, I was talking to Bernadette about it, and she gave me the clue I needed to explain why the load has increased so much.
It's the legacy posts. All 700,000 of them. They all reappeared in one go when we upgraded, and so all the search engines suddenly have a ton of new links to follow in order to grab the whole site. Hence, much more spider activity, leading to servers getting overloaded.
Is there a solution? Probably. In the short-term, I can block access to the site to all offending robots. This would mean that during this time, Google would no longer be spidering the site, so we might drop off the radar.
A better solution is to do a special skin for spiders - if we detect a spider's user-agent, we can deliver a 'pared down' skin which only displays the content, and doesn't link to (for example) forums, or the autogenerated pages like Who's online. So personal spaces and articles (and the frontpage) would appear in the search engine results, but the tens of thousands of forum pages would never be linked to, so the spiders would never fetch them.
This way, our principal content pages would still appear in the search engines, but we wouldn't need to be spidered half as much.
We'll see how the short term fix helps over Christmas, anyway.
I've got a theory...
Bogie Posted Dec 12, 2002
Have you tried writing a robot exclusion file:
http://www.robotstxt.org/wc/norobots.html
It worked wonders when my institution added extra material to our teaching servers. Google (and the other offenders) were just blocked from cataloguing certain pages. You could try blocking anything which begins with a F#### but still let the spider catalogue all the A#### and C#### pages.
Just a thought!
B.
I've got a theory...
Jim Lynn Posted Dec 12, 2002
The trouble with the noindex meta tag is that the server still has to serve the page. What I need to do is allow the spiders to look at article pages, and follow links between those, but not links to forums or auto-generated pages.
robots.txt might be an option - I'd need to enquire with our search team, but I'm not sure the filtering can be as sophisticated as I'd need it to be. It would be nice if it would, though.
I've got a theory...
Frankie Roberto Posted Dec 14, 2002
Did we ever find a fix to the problem of robots indexing pages in all the skins? ie you do a search on BBCi and get h2g2 coming up three times, once for each skin.
I've got a theory...
Z Posted Apr 14, 2004
So did we ever come up with a solution to this then? I've noticed for a while h2g2 doesn't come up on Google when it used to be amoung the first two or three results?
..
I've got a theory...
Jim Lynn Posted Apr 14, 2004
We're still in Google - we don't block them any more. We don't necessarily get as good page ranking, though.
I've got a theory...
Jim Lynn Posted Apr 14, 2004
Probably because far fewer pages are indexed, so there's far fewer links. Or Google might have tweaked their ranking system. Difficult to say.
Key: Complain about this post
I've got a theory...
More Conversations for Jim Lynn
Write an Entry
"The Hitchhiker's Guide to the Galaxy is a wholly remarkable book. It has been compiled and recompiled many times and under many different editorships. It contains contributions from countless numbers of travellers and researchers."