XML primer

XML is the most used and least understood term in publishing. So here’s our brief primer for publishers.

What is XML?

XML is computer code for labelling content. XML is a generic term; there are many kinds of XML for labelling different kinds of content.

The many kinds of XML have one thing in common: they use tags in elbow-brackets to label content:

<greeting>Hi!</greeting>

They mark up that content, which is where the ‘ML’ come from: markup language. We’ll get to the ‘X’ in a moment. Computers follow the markup labels – or tags – when processing the content.

Why is this useful?

Because, by using consistent tags, we can teach computers to do something specific with each tag. We can make content machine-readable.

For instance:

<debit>500</debit> could tell a computer to take R500 off my credit card.

In these three lines of XML

<title>The Great Gatsby</title>

<price>$5</price>

<source>famousbooks.com</source>

we have a simple, one-item book catalogue that can be shared with many booksellers’ ordering software.

Once content is machine readable, we can use computers to share it, remix it, apply different designs, and output it to print, websites, ebooks and promotional material.

Catalogues in XML can be shared with booksellers and distributors automatically.

Book content in XML lets you move quickly when a great new sales channel comes along that requires a new format. For a large number of books, it’s quicker and more accurate to get a computer to convert your XML into the correct format than it is to get a team of computer operators to copy and paste manually.

Why are there many kinds of XML?

Because, for example, a doctor and a book publisher mean something different when they tag something as a <head>. The doctor and the publisher have different vocabularies. There are also vocabularies for weather, financial statements, theology, travel bookings, astronomy, and hundreds of other fields.

Someone hands me an XML document. How do I know what XML vocabulary is being used?

It’s stated at the top of the document. Here, we call a vocabulary an XML namespace.

In ebooks, we almost always use the HTML namespace. But HTML doesn’t have enough tags to describe every feature of a book. For instance, there’s no <dedication> tag.

So before we make the ebook in HTML, we can store the book content in another namespace with a bigger, richer vocabulary, like TEI or DocBook. Then we have hundreds of tags for nearly everything that might ever appear in a book. (That’s a lot of tags for a person to learn.)

Can we automatically translate one vocabulary into another?

In human languages, you can roughly translate words (pig = vark) directly. But the grammar of two languages can be completely different. The same goes for XML namespaces. You might be able to roughly translate the tags, but the structure is much harder to translate. You need very clever software to get even close.

For example, a doctor might record patient notes like this:

<body>

<head>Painful</head>

<torso>

<chest>Pain on left side</chest>

<arms>

<right-arm>No symptoms</right-arm>

<left-arm>Painful, numbness</left-arm>

</arms>

</torso>

<legs>

<right-leg>No symptoms<left-leg>

<left-leg>No symptoms</left-leg>

</legs>

</body>

Notice how the body-part tags appear inside their ‘parent’ body parts (e.g. chest is a sub-section of torso). We say they’re ‘nested’ inside each other, like Russian dolls.

A vet’s XML tags might use slightly different vocabulary and be structured differently:

<body>

<head>Drooping</head>

<trunk>Swollen</trunk>

<legs>

<front-right>No symptoms</right-arm>

<front-left>Numbness</left-arm>

<rear-right>No symptoms<left-leg>

<rear-left>No symptoms</left-leg>

</legs>

</body>

So, once you’ve chosen a vocabulary, you need to commit to it for a while. (Or pay a software developer to make a translator. More on that later.)

How do we choose a vocabulary?

First, find out what others in your industry are using. If everyone in your industry used the same vocabulary, there would be loads of cheap software tools, experienced service providers, and online forums to learn from. And when you hire new staff, they’d know your vocabulary already.

Popular vocabularies for books are DocBook, TEI, and XHTML.

Can we create our own vocabulary?

Sure. But that’s like inventing your own language that only your family speaks. It may serve your purposes at home well, and be very specific – but using it with anyone else is problematic.

How do we create the XML?

You can outsource it or assign someone to learn it in-house. Either way, a human being will be marking up the document paragraph by paragraph using XML-editor software. This software often lets you work in two ways:

The code view: This way, you work directly with the code, putting in the <code>tags</code> by hand.
The WYSIWYG view: This way, the text looks something like a word-processor, and you tags things by clicking, dragging, and choosing tags from menus.

You might use powerful XML software like Oxygen. If you have chosen a simple vocabulary like XHTML, you can use one of many cheap or free editors, like Sigil.

It’s very important that someone in-house knows how to write and structure the XML, so that you can fix and spot problems and discuss solutions with suppliers.

How do we put our XML in an ebook?

More special software tools, each of which do this a different way, depending on your XML vocabulary, in-house skill, and what you want to produce.

For instance, if your designers are sharp and producing both print and ebooks, you’ll import XML into InDesign, tweak page layout, then export to print PDF and epub format ebook (which is mostly XHTML).

After that, using a free converter like Calibre or Kindlegen, you can convert the epub into mobi format for Kindle.

How do we control how it looks?

This is very important: XML marks up the purpose of content only, and not its appearance. Appearance is controlled by a stylesheet, which is quite separate from your XML. By attaching a stylesheet to an XML document, an ereader knows what to make each element look like.

Stylesheets are written in a language called CSS: cascading style sheets. (Don’t worry about the ‘cascading’ part at this point.)

Here’s an example of XHTML, its stylesheet, and what might appear in an ereader or web browser.

XHTML code CSS code Appearance in ereader

<h1>Chapter 1</h1></p>

<p>It was a dark and stormy night.</p></td>

h1 {</p>

font-family: sans-serif;

font-weight: bold;

color: grey;

}

p {

font-family: serif;

}</td>

Chapter 1</p>

It was a dark and stormy night.</td> </tr> </tbody> </table>

CSS can control an immense array of display features. To see the power of CSS, visit csszengarden.com and epubzengarden.com. These two websites have one set of XHTML files, but several CSS files to choose from. By applying a different CSS file to the same XHTML, you get a completely different look and feel and even layout.

Where to start?

If you want to store book content, start with HTML. If your team’s new to XML, appoint someone on your team to learn HTML and basic CSS. Then practice by creating epubs.

If you want to start using XML for marketing information (book metadata), start by learning how to create OPDS catalogues. OPDS (Open Publication Distribution System) is a kind of XML for sharing your catalogue with booksellers, distributors and ereaders.

They key is this: the team must understand – and see in practice – the distinction between content tagged in markup code and how an ereader or web browser displays that content.

Give your team space to learn and experiment. Once they’re comfortable, they’ll be keen to start learning about more complex XML vocabularies like ONIX and DocBook.

Or you can outsource the whole thing, but that defeats the purpose. If you outsource too early, without learning the nitty gritty in-house, you will never develop in-house expertise, and you will always be at the mercy of your suppliers, and unable to innovate technologically.

What are we aiming for?

The real champions of XML are those teams that have mastered an XML-first workflow. What this means is that from early on in the book-development process, perhaps as early as the writing or the editing, everyone is working in XML.

This is different from an XML-last workflow, where books are created the same way they always have been, and after the book’s gone to print, someone sends the PDF to an ‘encoder’, usually offshore, who manually ‘captures’ the book in an XML form.

Each company has to choose how early in their production process (the workflow) do they start working in XML.

Why is early good? Well, XML lets us get computers to manage content. It lets us automate things. What can be automated? Some examples:

Version control: if everyone is working in XML on a server, the computer can keep track of who made what changes and when.
Web-based access for collaborative writing and editing: A computer can let you log into a manuscript from anywhere and work on it at the same time as someone else. There is only one master document.
Testing different designs: By applying different stylesheets to the same XML, you can instantly test different layouts at the click of a button. Importing XML into a carefully created InDesign template can cut 80% off typesetting time.
Real-time content management: By using a content management system, you can see at a glance what titles are in production, where each project is, how much has been written and by who, and more.

Arthur Attwell 20 February 2012

This information is more than two years old, and may no longer be accurate.