Thinking

Creating book indexes in the age of ebooks

If you’ve bought a print copy of The Economy in the last few months, you won’t have noticed that the index at the back is special. We don’t want you to notice anything unusual about it. To us bookmakers, however, it represents a game-changing approach to creating indexes:

You can use the same index in the free website edition, where there are clickable links at each entry instead of page numbers.
No one had to list the page numbers in the print edition manually. When we update the book, the page numbers in the index are generated automatically.
We now compile the index while the book is being produced – not at the end of the publishing process, in that awful time-crunch before going to print.

I’d like to explain why that’s special, and how we did it. Read on for as long as you’re curious! Towards the end of this article, the details get more technical.

What makes indexes special?

A good index – the long list of terms and page numbers at the back – is wonderfully useful. A well-crafted index can be interesting to read even when you’re not looking for something specific.

Good indexes are created by skilled, professional indexers. They do not just list words they find in the book. Rather, they create a taxonomy of concepts: a map of ideas in the book. For example, a good indexer might include an entry for ‘democracy’ even if the word ‘democracy’ is never used explicitly. This is why good indexes cannot be generated by computers. A good index is a work of literature in itself.

For decades, the American Society of Indexers has given awards for excellent indexing, to recognise the craft involved. The UK’s Society of Indexers has published The Indexer: the International Journal of Indexing since 1958. Indexing is a big, serious part of book publishing.

How are indexes made traditionally?

Traditionally, the indexer works from a PDF of the book’s final layout. Many indexers will work on paper, marking the pages with coloured pens. Others will annotate a PDF on screen. They’ll read the book closely, developing and reworking their taxonomy, and noting page numbers for each concept manually. Some will use special software to compile their final index, while others will use Word or Google Docs or a spreadsheet.

That works well if the page layout is fixed, but there are challenges:

The indexer only gets to do their work at the end of the production process. Almost always, the project is already running late by that point, and they have to work under tremendous time pressure. This means most indexes are not as good as they could have been.
Their work is only useful for that particular page layout. There’s little point including a list of page numbers in an ebook, where text reflows; where each ‘page’ is just the text that fits on the screen at that font size. The same applies to alternative print editions (e.g. a schools’ edition, a large-print edition, or a regional adaptation), where page numbers are different.
When the publisher inevitably makes corrections or updates the book, someone has to manually check whether text reflowed, and update page numbers accordingly. Many boring internships have been spent updating page numbers in indexes.

As a result, indexes have only been available to readers in print (or a PDF equivalent). In ebook and website editions, we have had to rely on search, which is a poor substitute when you’re looking for concepts rather than exact phrases.

Locations, not page numbers

The way forward, then, must lie in replacing page numbers with digital locations. Rather than figuring out page numbers for, say, ‘democracy’, we need to define:

At what point exactly, between one character and the next, is the concept first discussed?
At what point exactly does that discussion end?

If we can mark those locations invisibly in the text of the book, our index can refer to those locations.

Then, when we create a PDF for print, software figures out the page number for a given location, and inserts that number into the index. In reflowable formats (epub, web, app), instead of page numbers, that software inserts clickable links that jump to the relevant locations.

There are several advantages to this approach:

We can index early in the production process, rather than waiting for page layout to be finalised.
The index can be useful during the writing process. This is especially useful for multi-author teams on large books, where contributors need to look up where concepts have already been addressed.
When you make updates or corrections that cause text to shift, you don’t need to check all the page numbers in the index.
The index’s wonderful taxonomy of concepts is now available to readers in all formats.
When you translate a book into another language, you can translate the index at the same time, and have it work immediately.

We call this approach dynamic indexing, because the page numbers change along with the layout.

How we make dynamic indexes

In our workflow, we store book content as plain text. Specifically, we format that text following a convention called markdown. Our Electric Book software then processes that markdown, along with design stylesheets, to generate a PDF, an ebook, and a website, as needed.

When we first started experimenting with dynamic indexing, for every location of a concept we had to manually create:

An HTML anchor tag in the book text, with a unique ID
A hyperlink in the index pointing to that ID.

A textbook index might include thousands of entries, each with several page numbers (i.e. locations), so this manual approach was too labour intensive to be feasible. After experimenting over several years, we arrived at a way to accomplish this efficiently. There are now two manual steps:

Mark the locations of concepts in the text in a quick, easy way.
Create a list of those concepts at the back of the book.

Our software then generates the necessary tags and links that render as the index in the finished book.

Marking locations

To mark the location of each indexable concept, we add tags to our markdown. (Technically, we’re using HTML’s comment syntax.) For example, if we were indexing the Sherlock Holmes story A Scandal in Bohemia, we might include an indexing tag before a paragraph about the two main characters. This is what that would look like in our markdown:

<!--index:
Holmes, Sherlock
Adler, Irene
-->

To Sherlock Holmes, she is always THE woman. I have seldom heard him mention her under any other name.

In the final index, the link in web, app and epub formats, and the page number in PDF, will point to the start of the paragraph, ‘To Sherlock Holmes…’.

We can also put tags inside a paragraph, to be even more precise about the location of a concept:

To <!--index: Holmes, Sherlock --> Sherlock Holmes <!--index: Adler, Irene --> she is always THE woman. I have seldom heard him mention her under any other name.

We also have ways to tag nested entries. For example, the index for our Holmes story might list ‘marriage’ as an entry, and then ‘unhappy’ and ‘secret’ as subentries beneath it.

And we have ways to mark the start and end of a concept range. For instance, in a book on French literature, several pages might be dedicated to discussing the work of Albert Camus. The index in print might read, ‘Camus, Albert 25–28, 230–42’. In a reflowable format, those ranges would show as ‘Camus, Albert 1–2, 3–4’.

If you’re curious about the technology involved, our documentation has more detail.

Creating the index list

We do still want a professional indexer to do the work of designing the taxonomy of concepts. So they will still produce that list, with all its nested entries and ‘see also’ cross-references.

The difference now is that they do not need to include any page numbers as they work. Instead, they add concept tags to the book content.

An advantage of this is that they can start work long before the production process is complete – for instance, as soon as copy-editing is done, and before design and page layout. They even can do that work while someone else works on the design and layout. In one of our current projects, the authors are adding concept tags as they write. That way, the index develops along with the content, and can be refined by a copy-editor or professional indexer.

The future of indexes

A couple of years ago, as readers and publishers worried about the future of print, we also worried about the future of the book index. We wondered: as ebooks and web books reduce the importance of print, could the art and science of good indexing be lost?

We had nothing to worry about. Indexes are a fundamental part of good non-fiction, and now they can be created, efficiently, for all book formats. They remain as useful and important as ever.

Arthur Attwell 25 October 2022

Image by Ian Panelo, Pexels