INDEXING
Created | Updated Oct 20, 2010
Lots of people have very wrong ideas about indexes and how they are produced. Here I am talking about back-of-the-book or document indexes. You can also get indexes for journals, newspapers, photographs, electronic media - such as web documents, CD discs and so on. I've been prompted to write this by a letter I've just seen in a computer magazine's help section. The writer wanted to know how he could get a word-processing program to automatically index a text. He claimed to be involved in the publishing industry but possibly that was just a way of saying he delivers newspapers before school. The reply didn't question the usefulness of automatic indexing and proceeded to explain how one could produce a list of all the words used in the text, remove any words not required to be indexed (such as "a", "and", "the", "is"), and then get the word-processing software to index every occurrence of the words on the list.
What an Index is
What is being produced in the above example is called a concordance. In fact the writer who answered the letter actually said this but did not dwell on how a concordance differs from a "normal" index, although it is obviously an index of sorts. Concordances have been used in studying the bible and in the analysis of literary texts. However a book contains information and the index is a guide to help the user find the information he wants. Obviously we're talking about non-fiction and reference books here. A concordance is no good because the mere fact that a word occurs on a page does not mean that it adds any real meaning to the text. For example a biography about a novelist may say that he travelled by car from A to B to meet his mistress. The subsequent encounter may well be worth indexing but a reader seeing "cars" in the index and going to the indicated page is likely to be disappointed at finding no information about cars. So at the very least we need to be very selective about which words we index.
On the other side of the coin entries in the index may not occur in the text. A page of a biography may refer to the subject's adultery without ever using that word. It would be entered in the index if the indexer thought that was the word that readers might choose to look up. The indexer might have to choose between "adultery", "love affairs", "liasions" and so on depending on his judgement of the readership. Of course they don't have to be mutually exclusive and all three entries may occur. A book on the life of Dick Turpin may give information about his horse on several separate pages calling it a "horse" on one page, "steed" on another, and "Black Bess" on a third. A simple concorcandance would index this as 4 items and a reader looking up "horse" would not find the information on the page where it is referred to as "steed". And "Black" and "Bess" would possibly be separated by many lines rather than recognized as two parts of a single concept. A human indexer would probably index all three bits of information under "Black Bess" with perhaps "see Black Bess" under "horses", "steed" being rejected as not a likely word that a reader wanting to find information about horses would use. An opposite situation could arise where the same word occurs in a text with two different meanings eg. "jet" (the stone) and "jet" the plane. All this shows that the first task of an indexer is select the terms to be used and to control the vocabulary. Automatic machine indexing, if it is to be really useful, demands a much higher level of sophistication that the simple sort of program that was being referred to at the start of this article. But note that the vast amount of material on the web requires that we develop some sort of automated technique to access it.
Have chosen the terms to be indexed we have the relatively mechanical task of ordering it but even here there are are many details to be taken into account. The indexing facilities in a program such as Word are not really adequate. Generally the order is alphabetic but there is a choice between letter-by-letter and word-by-word. In the former spaces are ignored and in the latter a space has lower value (comes before) a character eg.
word-by-word letter-by-letter
figs figs
fig apples fig apples
fig trees fights
fights fig trees
We also have to be clear how we deal with numbers (Roman and Arabic), hyphens and other punctuation, special characters such as slash and ampersand, accented or foreign characters, personal names, and geographic names. Try to index just a few pages of a computer or technical manual, a medical, legal, or historical text and you are likely to come across some such problem.
INDEXING SOFTWARE
Once the indexer has chosen his terms and his made decisions about the above he or she can then turn to software to help produce the index. Indexing software does exist which is flexible enough to produce indexes in a variety of formats and allow the indexer to set preferences for exactly how the order is to be implemented. Some links are given below.