Webmaster's voice illustration

A   W E B M A S T E R ' S   V O I C E

One of the frustrations of being a print journalist is that you can never fit everything you discovered in researching a piece into the space available and inevitably as soon as you have written something it will be out of date. My article on the difficulties that search engines have in keeping up with the web was particularly problematic - I ended up writing 25 percent more than could fit into the magazine even though there was plenty left to say, and within days of finishing the article I heard about new technologies and techniques that could solve some of the problems I discussed.

 Fortunately, this web site gives me as much room as I need to fill in the gaps and can be updated as new information comes in, either from the people I talked to or from you, the public.

 There are two basic problems with covering the breadth of information available on the internet using a search engine. The first is the sheer size and speed of growth of the web, which has meant that hitherto only a small and decreasing proportion of it has been indexed by many search engines. As it turns out, though the three search engine representatives I spoke with all appeared to have more or less abandoned the notion of indexing the whole web, the last one I spoke with (too late to be included) was more optimistic.

 Eric Brewer, co-founder of Inktomi, the technology company behind Wired's HotBot search engine, believes that the way it handles its database of the web, splitting it between dozens of ordinary computers instead of a few very large machines, will make it easier for HotBot to index a larger proportion of the web. It already indexes 55 million pages (most of the other major ones cover about 30 million) and is about to add "significantly" to the size of its database. Hotbot also claims to be able to refresh its database at the rate of 10 million pages a day. But even if Hotbot manages to index a substantial majority of the "visible web" and keep up with its phenomenal growth a large amount of data will remain un-indexed because of the way the web has changed.

 As I discussed in my original article, quite a lot of information accessible through the web is not easily readable by a general web index - it may be behind a password barrier for various reasons (as our own web site was until April) or in a database which is only accessible by typing in a query. At present, search engine providers are quite relaxed about this - they argue that it is their responsibility to direct you to the right site for your interest, not necessarily the exact page you are looking for. They are just the first step towards finding the data you seek. As you will see in "A Webmaster Searches", however, some of the best stuff will probably never be indexed by conventional search engines. In the coming weeks I hope to publish comments on this issue from you and from the people I talked to in researching this piece. I also plan to offer you some home for the future by outlining some of the ways researchers are examining to make it easier to index the internet in future. Stay tuned! 

A webmaster searches:

To show you the kinds of tricks you can use here is an example of the ways I use the internet to find what I need. In this case, I'm looking for biographical information on Colette, the early 20th Century French author (and one of my favorites).

 The fastest and probably the best way to start is to go to resources you already know - some of the largest sites are not indexed or only superficially indexed in search engines, so going straight to them is the only way to see if they contain information you seek. In this case I know that Amazon.com and BarnesandNoble.com are two of the largest booksellers and that the book review databases of the New York Times and the Times are both online. As it turned out, the major book stores had no additional information about Colette except a list of her books and her full name (Sidonie-Gabrielle Colette). The New York Times, however, had reviews of five books by or about her, including a review of a biography which itself provides a brief outline of her life. As it turned out even after the extensive searching that followed this was the best I found and it would almost certainly not have been indexed by any of the search engines or directories. Even if they don't offer the answer themselves, bigger sites usually have an index of links to other, similar, sites in their subject area. My next step, unless what I seek is obscure, is always Yahoo UK - people located in other countries should of course consult their own local Yahoos. Its database of sites is created by humans instead of computers so although this makes it necessarily much smaller than a computer-generated index it is also much better organized. If you find a site in its database that is relevant you can easily find the categories where similar sites are found - it can also help you find the best sites quickly as some sites are also rated. 

Sometimes, however (as with my search for Colette) it turns up nothing or not enough. Then it's time to consult the search engines. There are dozens of sites around which allow you to consult several at once, though few will automatically translate any boolean terms (Colette AND French AND Author) so they are correctly sent to each engine in its own format or process the results so you don't read many of the same results once for each search engine. MetaFind (one of the few that covers HotBot) and Metacrawler are the two which look most promising.

 I still tend to stick to a single search engine for all but the most demanding searches, as each has its own unique capabilities and quirks which you can only use (or work around) if you practice regularly. Until recently the one I used frequently was Altavista. It had at the time it was launched the largest web database, a powerful set of searching options, it didn't take ads and its graphics could be "turned off" by bookmarking the right page. Unfortunately its database has stagnated and when Infoseek launched Ultraseek - with a somewhat larger and much more up-to-date database - I began to use that instead. When Hotbot first launched I found its highly graphical interface unappealling, but its designers have slimmed down the search page since and now that I know about its large database size and expansion plans I intend to get to grips with it.

 No matter how good the search engine is that you use, you will still find it next to useless unless you compose your search carefully. You might think with highly specific technical words you can get a useful result with a single word, but even looking for "phthalate" in HotBot gives more than 2,000 matches. Amazingly, most people still use one or two search terms which is far too few to narrow down a database of millions of documents to the ten or twenty most useful ones. It doesn't help that (acording to Search Engine Watch) Alta Vista, Excite, Lycos and WebCrawler will naturally search for pages containing any of your search words rather than all of them though they will usually list documents that match all words first.

 Search engines recognise this problem and have come up with a number of solutions - Excite offers a "more like this" function which lets you indicate which page in a search looks promising and tries to find similar ones in its database. AltaVista's "LiveTopics" tries to suggest related keywords which you might want to include or exclude from your search. In essence all these are trying to do is to aid you in coming up with more and better keywords to refine your search - something which you should be doing anyway. My best advice is to choose a good search engine and learn to use advanced queries. When I searched HotBot for "Colette" I found10,758 pages and the first ten pages were totally irrelevant. When I looked for "Colette AND author AND French" I found 655 pages (though again the first thirty pages contained little of value. Once I entered Colette's full name - "Colette AND Sidonie-Gabrielle" - I narrowed down to just 28 matches, most of which were at least some use - including two brief biographies and a birthdate. But it didn't find the more detailed and useful information available on the New York Times' book review pages because its spider isn't allowed in to search them.


Find here further comments and suggestions on internet related issues, from our webmaster and readers:

Can you believe the internet? - How can you find information you trust online? I share a few of my own rules of thumb and the readers share theirs 
© Copyright New Scientist, IPC Magazines Limited 1997