A Conversation for A Guide to Using Search Engines

The Grub project

Post 1

Peet (the Pedantic Punctuation Policeman, Muse of Lateral Programming Ideas, Eggcups-Spurtle-and-Spoonswinner, BBC Cheese Namer & Zaphodista)

An exciting development is "Grub's Distributed Web Crawling Project". Funded by LookSmart, it is a distributed application in similar vein to the "[email protected]" client, but instead of using your spare processor time it utilises your spare bandwidth to crawl URLs in batches of 500, notifying the central server of which URLs have changed since the last time they were crawled, and forwarding a snapshot of the HTML content of the page. The idea is that, when fully deployed, it will be able to crawl and refresh ***every page on the web, every day***! The upshot of this will be an accurate search engine with no "expired" links. smiley - geeksmiley - wow

Check out http://www.grub.org for details; if you have a 24/7 flat fee broadband connection, you might consider running it overnight. It makes a good match with [email protected] for that sort of use, as while it uses a lot of bandwidth but very little CPU, [email protected] (http://setiathome.ssl.berkeley.edu/) uses a lot of CPU but virtually no bandwidth. smiley - ok


The Grub project

Post 2

Martin Belam

we had a bit of a chat about this over on collective [F99145?thread=268091]

the trouble with Grub is that the spider that actually collects the pages is pretty badly behaved - doesn't obey robots.txt properly (which is a way webmasters can give instructions to search engines on what they can and can't index) - and doesn't seem to have a proper way of avoiding hammering sites by repeatedly requesting pages from them over a short period of time.


The Grub project

Post 3

Peet (the Pedantic Punctuation Policeman, Muse of Lateral Programming Ideas, Eggcups-Spurtle-and-Spoonswinner, BBC Cheese Namer & Zaphodista)

True. It's in beta at the moment... There seems to be a lot of coding activity going on, though, so the most anti-social behaviour is being weeded out first. smiley - geek If nobody signs up, they'll never have any incentive to improve it... smiley - biggrin


Key: Complain about this post