Serving XHTML as application/xhtml+xml

It’s nearly a decade since W3C produced the first XHTML standard. In all that time, very few sites adopting it have gone as far as to serve the preferred MIME type (application/xhtml+xml). This is because it has been difficult to do well, and text/html sort-of works, so most website administrators don’t bother. Here are some tips to make things easier.

First of all, this article isn't about whether XHTML is a Good Thing - personally, I take it for granted that it is, but I merely note in passing that a minority of people disagree. Rather, I'm assuming the adoption of XHTML and instead discuss how we want to move on from serving it as `text/html` to get the benefit of XML by serving it as `application/xhtml+xml` instead (good introductions: W3C.org: Serving XHTML 1.0; XML.com: The Road to XHTML 2.0 - MIME Types). Also, if you go for the slimmed-down XHTML 1.1, you really ought to use `application/xhtml+xml`. Remember when using XHTML you must be aware of the perils of using XHTML properly and take action as needed; it's not actually hard to do.

When I set about researching this topic, I found that only one existing article covers the topic broadly (MIME Types and Content Negotiation) and this only provides an overview. There are two broad approaches used to change the MIME type:

  • Use one of Apache’s features to detect the browser’s capability
  • Use server-side scripting - typically PHP - to detect the browser’s capability

Whilst a lot has been written on using PHP, this article instead is concerned with using Apache’s features. There are some good articles on using mod_rewrite to change the MIME type, and some good articles on using content negotiation for language switching, but none describe in detail using content negotiation for the purpose of switching MIME type. Almost all focus has been on using mod_rewrite. Although this isn’t a problem, arguably Apache’s content negotiation feature is more appropriate. It provides an alternative to using mod_rewrite, and then mod_rewrite can be used for other things it’s good at, without it needing the added complexity of header switching.

Content Negotiation

How does content negotiation work? In summary, your browser requests include several ‘Accept...’ header fields to tell the server what it accepts. The server sends back things it thinks the browser would be best able to display.

In fact, Apache supports ‘server-driven’ content negotiation as defined in the HTTP/1.1 specification, fully supporting the Accept, Accept-Language, Accept-Charset and Accept-Encoding request headers. My Firefox browser just made a request including these headers:

Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-gb,en;q=0.7,cy;q=0.3
Accept-Encoding: gzip,deflate
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.</pre>

For the sake of this particular discussion, we’ll only consider the first of these: Accept. The Accept header states that my Firefox supports text/html with an implied quality of 1.0, application/xhtml+xml with the same implied quality but second preference, application/xml with the lower preference of 0.9 and finally anything else (*/*) with quality 0.8.

How Do Various Browsers Compare?

If we want to use server-driven content negotiation, we need to be sure it will work in practice. So several browser were checked and these headers were seen:

Layout Engine Browser Accept
Amaya Amaya 11.1 */*;q=0.1, image/svg+xml, application/mathml+xml, application/xhtml+xml
Built-in Elinks 0.11.1 */*
Gecko Epiphany 2.22 text/html, application/xhtml+xml, application/xml;q=0.9, */*;q=0.8
Gecko Firefox 3.0 text/html, application/xhtml+xml, application/xml;q=0.9, */*;q=0.8
Trident Internet Explorer 6 */*
Trident Internet Explorer 7 */*
Trident Internet Explorer 8 */*
KHTML Konqueror 4.2.2
text/html, image/jpeg;q=0.9, image/png;q=0.9, text/*;q=0.9, image/*;q=0.9, */*;q=0.8
Built-in Lynx 2.8.5 text/html, text/plain, text/css, text/sgml, */*;q=0.01
Presto Opera 9.64 text/html, application/xml;q=0.9, application/xhtml+xml, image/png, image/jpeg, image/gif, image/x-xbitmap, */*;q=0.1
WebKit Chrome 1.0
text/xml, application/xml, application/xhtml+xml, text/html;q=0.9, text/plain;q=0.8, image/png,*/*;q=0.5
WebKit Safari 528.16 application/xml, application/xhtml+xml, text/html;q=0.9, text/plain;q=0.8, image/png,*/*;q=0.5

Generally, it’c obvious from the table that the layout engine dominates the accepted types. For example, Epiphany and Firefox are both based on the Gecko engine and accept exactly the same. Likewise Chrome and Safari are almost the same, and share the WebKit engine.

Internet Explorer and Elinks blandly say they support everything. I suppose as a rough approximation this might be nearly true if it is interpreted as ‘having a go’ at any type of content but it does seem rather unlikely that every kind of content will be handled. Nevertheless the HTTP standard allows it and that’s what these browser use. Amaya is a bit more honest about the ‘have a go’ at anything stance: it states “/;q=0.1” which, in English, means it will have a go at anything but with only a 0.1 (ie. 10%) preference for unknown content compared with its preferred list of alternatives.

This information tells us IE accepts anything, but Microsoft have stated that IE doesn’t yet officially handle XHTML. So does this prevent us from serving XHTML via content negotiation? Does IE make it hopeless to attempt to select content based only on the ‘Accept’ header because we know IE can’t accept XHTML even though it says it accepts anything and everything?

Well, no, fortunately. Investigation suggests that IE does not actually let us down. In a short while, we’ll look at the results of a compatibility test. But first, having heaved a sigh of relief, let’s consider how we can actually set up Apache to do content negotiation.

Apache Configuration

What we have to do is quite simple. There are three steps: set up our web pages; set some AddType directives, and alter the DirectoryIndex directive. I’m assuming you’re at ease with HTML page creation, but we go one step further: for every HTML page, we also create an XHTML page. This might sound like we’re wasting space, but it won’t if you have a Linux server. You simply make all your pages valid XHTML (important!), and then give each file two different names. Linux makes this easy: you use the ‘ln’ command to make symbolic or ‘hard’ links just like this:

for f in \*.html; do ln $f $(basename $f .html).xhtml; done

This command creates a ‘hard’ link called something.xhtml for every file called something.html. You could use symlinks if you prefer (via ‘ln -s’). You could also just copy the files (via ‘cp’ on Linux) but that will use up more disk space. Of course, ‘ln’ could be used either way round: if you start with the XHTML files, you can easily ‘ln’ them to make identical HTML files. Either way, we end up with each XHTML file having an identical HTML file with an identical filename except for the extensions, which are .html and .xhtml.

OK so let’s have a look at the Apache directives. With the AddType directive, you can map a given filename extensions onto the specified content type. We’ll use two of them: the first one maps our XHTML content type, application/xhtml+xml, to all files with filename ending .xhtml. The second one maps old-fashioned text/html to files with filenames ending in .html, but it does so with a lowered preference of 70%. This means that our default preference is for the server to serve XHTML, whereas HTML is served as second-best. Here are the two directives:

AddType application/xhtml+xml .xhtml
AddType text/html;q=0.7 .html

Our other Apache directive makes a small adjustment to the directory indexing so that Apache can choose what to do when the user asks for a URL ending with ‘/’. We want either the index.xhtml or index.html file to be chosen according to the content negotiation. Apache makes this simple using DirectoryIndex:

DirectoryIndex index

is all you need to configure Apache to do just this. You may have seen the DirectoryIndex directive used with a list of alternatives; that would do ok, but it’s simpler and clearer just to specify ‘index’.

OK, so now we have three Apache directives set up and all our pages exist in both HTML and XHTML form. Now for some testing.

The next table shows the same browsers we looked at earlier, but this time with the outcome of a compatibility test.

Browser Accept Renders application/xhtml+xml
Amaya 11.1 */*;q=0.1, image/svg+xml, application/mathml+xml, application/xhtml+xml yes (incorrect layout)
Elinks 0.11.1 */* yes, only as text
Epiphany 2.22 text/html, application/xhtml+xml, application/xml;q=0.9, */*;q=0.8 yes
Firefox 3.0 text/html, application/xhtml+xml, application/xml;q=0.9, */*;q=0.8 yes
Internet Explorer 6 */* yes (incorrect layout)
Internet Explorer 7 */* yes (slightly incorrect layout)
Internet Explorer 8 */* yes
Konqueror 4.2.2
text/html, image/jpeg;q=0.9, image/png;q=0.9, text/*;q=0.9, image/*;q=0.9, */*;q=0.8 yes
Lynx 2.8.5 text/html, text/plain, text/css, text/sgml, */*;q=0.01 yes, only as text
Opera 9.64 text/html, application/xml;q=0.9, application/xhtml+xml, image/png, image/jpeg, image/gif, image/x-xbitmap, */*;q=0.1 yes
Chrome 1.0
text/xml, application/xml, application/xhtml+xml, text/html;q=0.9, text/plain;q=0.8, image/png,*/*;q=0.5 yes
Safari 528.16 application/xml, application/xhtml+xml, text/html;q=0.9, text/plain;q=0.8, image/png,*/*;q=0.5 yes

The right-hand column shows my observations. Until I ran this test and saw the results, I had been worrying that Internet Explorer would somehow let me down. We know it doesn’t support XHTML fully, unlike the Gecko and WebKit browsers, and we also know it doesn’t tell the server what it wants very clearly: / is the best it can do. So there was reason for concern that any attempt to do content negotiation would fail.

But I was pleased to learn that IE happily receives XHTML and displays the pages just as normal (there are a few layout and CSS problems - but that’s a different story). Of the other browsers, only Lynx explicitly doesn’t handle XHTML - Apache negotiates the content seamlessly and we see the HTML page displayed correctly (albeit as text only in Lynx’s case).

Conclusion

This investigation of server-side content negotiation has shown that there is a practical way to configure Apache to serve XHTML using its preferred application/xhtml+xml MIME type. All major browsers behave well and the solution is relatively painless to apply. There is no need to use mod_rewrite - this is handy if you’re already using mod_rewrite for some other purpose.

Footnote - A Boost for Performance

I heartily recommend Firefox with the Firebug and YSlow plugins. I’ve learnt a lot from YSlow about making my websites load more slickly. I can also recommend Charles Proxy as a diagnostic tool for undertaking this sort of investigation.

Appendix - The Final Configuration

Below I’ve listed a virtual host configuration illustrating the points made above.

<VirtualHost *:80>
  ServerName      sibdev11
  #ServerAlias  ...  ...  ...
  DocumentRoot   /home/websites/sibdev/htdocs
  ErrorLog       /var/log/apache2/sibdev-error.log
  CustomLog      /var/log/apache2/sibdev-access.log combined
  LogLevel       info

  # performance tweeks
  FileETag none
  ExpiresActive On
  ExpiresDefault                 "access plus 30 days"
  ExpiresByType text/css         "access plus 24 hours"
  ExpiresByType text/javascript  "access plus 24 hours"
  ExpiresByType text/html        "access plus 24 hours"

  # mime control
  AddType application/xhtml+xml .xhtml
  AddType text/html;q=0.7 .html
  DirectoryIndex index

  # compression
  AddOutputFilterByType DEFLATE application/json application/javascript
  AddOutputFilterByType DEFLATE application/xhtml+xml application/xml-dtd
  AddOutputFilterByType DEFLATE image/svg+xml text/css text/javascript text/html
  <Directory  "/home/websites/sibdev/htdocs">
    Options MultiViews FollowSymLinks
    AllowOverride None
    Allow from All
  </Directory>
</VirtualHost>

Further Reading

 
comments powered by Disqus