h2g2 - This is a Journal entry by Jim Lynn

This is a Journal entry by Jim Lynn

I've got a theory...

Post 1

Jim Lynn Started conversation Dec 12, 2002

...it could be bunnies^H^H^H^H^H^H^H spiders.

I've just got the latest log files from the servers, and have started looking for the reason our servers are under such unexpected load, when our weekly stats don't show anything out of the ordinary. As often happens, around the time the server fails, there are plenty of requests from search engine spiders (Google being particularly busy). This would explain why our figures aren't increasing but our load is, but it wouldn't explain why it's worse since the upgrade (which it most definitely is).

Then, at about 2am this morning, I was talking to Bernadette about it, and she gave me the clue I needed to explain why the load has increased so much.

It's the legacy posts. All 700,000 of them. They all reappeared in one go when we upgraded, and so all the search engines suddenly have a ton of new links to follow in order to grab the whole site. Hence, much more spider activity, leading to servers getting overloaded.

Is there a solution? Probably. In the short-term, I can block access to the site to all offending robots. This would mean that during this time, Google would no longer be spidering the site, so we might drop off the radar.

A better solution is to do a special skin for spiders - if we detect a spider's user-agent, we can deliver a 'pared down' skin which only displays the content, and doesn't link to (for example) forums, or the autogenerated pages like Who's online. So personal spaces and articles (and the frontpage) would appear in the search engine results, but the tens of thousands of forum pages would never be linked to, so the spiders would never fetch them.

This way, our principal content pages would still appear in the search engines, but we wouldn't need to be spidered half as much.

We'll see how the short term fix helps over Christmas, anyway.

I've got a theory...

Post 2

Bogie Posted Dec 12, 2002

Have you tried writing a robot exclusion file:

http://www.robotstxt.org/wc/norobots.html

It worked wonders when my institution added extra material to our teaching servers. Google (and the other offenders) were just blocked from cataloguing certain pages. You could try blocking anything which begins with a F#### but still let the spider catalogue all the A#### and C#### pages.

Just a thought!

B.

I've got a theory...

Post 3

Bogie Posted Dec 12, 2002

Or you could add a line of meta data to F# pages:

B.

I've got a theory...

Post 4

Jim Lynn Posted Dec 12, 2002

The trouble with the noindex meta tag is that the server still has to serve the page. What I need to do is allow the spiders to look at article pages, and follow links between those, but not links to forums or auto-generated pages.

robots.txt might be an option - I'd need to enquire with our search team, but I'm not sure the filtering can be as sophisticated as I'd need it to be. It would be nice if it would, though.

I've got a theory...

Post 5

Frankie Roberto Posted Dec 14, 2002

Did we ever find a fix to the problem of robots indexing pages in all the skins? ie you do a search on BBCi and get h2g2 coming up three times, once for each skin.

I've got a theory...

Post 6

Z Posted Apr 14, 2004

So did we ever come up with a solution to this then? I've noticed for a while h2g2 doesn't come up on Google when it used to be amoung the first two or three results?

..

I've got a theory...

Post 7

Jim Lynn Posted Apr 14, 2004

We're still in Google - we don't block them any more. We don't necessarily get as good page ranking, though.

I've got a theory...

Post 8

Z Posted Apr 14, 2004

why is it that we don't come up so highly anymore? or is that hte million dollar question?

I've got a theory...

Post 9

Jim Lynn Posted Apr 14, 2004

Probably because far fewer pages are indexed, so there's far fewer links. Or Google might have tweaked their ranking system. Difficult to say.

Subscribe | Unsubscribe

Key: Complain about this post

More Conversations for Jim Lynn

Write an Entry

"The Hitchhiker's Guide to the Galaxy is a wholly remarkable book. It has been compiled and recompiled many times and under many different editorships. It contains contributions from countless numbers of travellers and researchers."

h2g2 The Hitchhiker's Guide to the Galaxy: Earth Edition

Find h2g2 Entries:

The
Hitchhikers Guide
To The Galaxy

Earth Edition

This is a Journal entry by Jim Lynn

I've got a theory...

I've got a theory...

I've got a theory...

I've got a theory...

I've got a theory...

I've got a theory...

I've got a theory...

I've got a theory...

I've got a theory...

I've got a theory...

More Conversations for Jim Lynn

Write an Entry

Help

About Us

Contact Us

Follow Us

Statistics

Other Stuff

h2g2 The Hitchhiker's Guide to the Galaxy: Earth Edition

Find h2g2 Entries:

TheHitchhikers GuideTo The Galaxy

Earth Edition

This is a Journal entry by Jim Lynn

I've got a theory...

I've got a theory...

I've got a theory...

I've got a theory...

I've got a theory...

I've got a theory...

I've got a theory...

I've got a theory...

I've got a theory...

I've got a theory...

More Conversations for Jim Lynn

Write an Entry

Help

About Us

Contact Us

Follow Us

Statistics

Other Stuff

The
Hitchhikers Guide
To The Galaxy