By Robert Crews
Let's use XSLT to transform one XHTML document into another. It seems like a task XSLT was made for, but we'll learn the solution is not straight-forward. Our goal is to create an XSLT stylesheet to replace the header. Our initial HTML document contains two divs: one for the header and one for the content:
<html lang="en"> <head> <title>Sample Document</title> </head> <body> <div id="HEADER"> <p>This is the <a href="../index.html">navigation</a> section.</p> </div> <div id="CONTENT"> <h1>Sample Document</h1> <p>This is a sample document</p> </div> </body> </html>
We'll start with a stylesheet that copies the input to the output:
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> <xsl:template match="*|@*"> <xsl:copy> <xsl:apply-templates select="@*"/> <xsl:apply-templates/> </xsl:copy> </xsl:template> <xsl:template match="processing-instruction()|comment()"> <xsl:copy>.</xsl:copy> </xsl:template> </xsl:stylesheet>
Then we'll check our progress with Saxon:
rcrews$ java com.icl.saxon.StyleSheet -t -o out.html in.html xsl.xsl SAXON 6.5.4 from Michael Kay Java version 1.4.2_07 Preparation time: 263 milliseconds Processing file:/Users/rcrews/Desktop/in.html Building tree for file:/Users/rcrews/Desktop/in.html using class com.icl.saxon.tinytree.TinyBuilder Tree built in 48 milliseconds Execution time: 158 milliseconds
We got what we expected. The new meta element is Saxon's way of reminding us to explicitly state the file's character set and encoding:
<html lang="en"> <head> <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> <title>Sample Document</title> </head> <body> <div id="HEADER"> <p>This is the <a href="../index.html">navigation</a> section.</p> </div> <div id="CONTENT"> <h1>Sample Document</h1> <p>This is a sample document</p> </div> </body> </html>
We'll add an XSLT template to match and change the header:
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> <xsl:template match="div[@id='HEADER']"> <div id="HEADER"> <p>This is the <em>improved</em> header.</p> </div> </xsl:template> <xsl:template match="*|@*"> <xsl:copy> <xsl:apply-templates select="@*"/> <xsl:apply-templates/> </xsl:copy> </xsl:template> <xsl:template match="processing-instruction()|comment()"> <xsl:copy>.</xsl:copy> </xsl:template> </xsl:stylesheet>
This transformation produces the following:
<html lang="en"> <head> <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> <title>Sample Document</title> </head> <body> <div id="HEADER"> <p>This is the <em>improved</em> header.</p> </div> <div id="CONTENT"> <h1>Sample Document</h1> <p>This is a sample document</p> </div> </body> </html>
So far, so good. You might think we're done, but our input file is not really correct XHTML. Valid XHTML requires a document type declaration and correct XHTML requires a proper namespace attribute:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en"> <head> <title>Sample Document</title> <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> </head> <body> <div id="HEADER"> <p>This is the <a href="../index.html">navigation</a> section.</p> </div> <div id="CONTENT"> <h1>Sample Document</h1> <p>This is a sample document</p> </div> </body> </html>
Since our task was to produce XHTML, we'll need to add an output element to assure XML output and specify the XHTML doctype:
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> <xsl:output method="xml" doctype-public="-//W3C//DTD XHTML 1.0 Strict//EN" doctype-system="http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"/> <xsl:template match="div[@id='HEADER']"> <div id="HEADER"> <p>This is the <em>improved</em> header.</p> </div> </xsl:template> <xsl:template match="*|@*"> <xsl:copy> <xsl:apply-templates select="@*"/> <xsl:apply-templates/> </xsl:copy> </xsl:template> <xsl:template match="processing-instruction()|comment()"> <xsl:copy>.</xsl:copy> </xsl:template> </xsl:stylesheet>
Here's the result:
<?xml version="1.0" encoding="utf-8"?> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en"> <head> <title>Sample Document</title> <meta http-equiv="Content-Type" content="text/html; charset=utf-8"/> </head> <body> <div id="HEADER"> <p>This is the <a href="../index.html" shape="rect">navigation</a> section.</p> </div> <div id="CONTENT"> <h1>Sample Document</h1> <p>This is a sample document</p> </div> </body> </html>
The document type declaration is correct, so is the namespace attribute. I can easily remove the XML declaration with the omit-xml-declaration attribute of the output element. The shape attribute and its value come from the XHTML 1.0 DTD:
<!ATTLIST a
%attrs;
%focus;
charset %Charset; #IMPLIED
type %ContentType; #IMPLIED
name NMTOKEN #IMPLIED
href %URI; #IMPLIED
hreflang %LanguageCode; #IMPLIED
rel %LinkTypes; #IMPLIED
rev %LinkTypes; #IMPLIED
shape %Shape; "rect"
coords %Coords; #IMPLIED
>
This is correct, and it reminds me that my XML parser is reading the DTD every time I transform a document. DTDs are typically read from the SYSTEM URL in the document type declaration of the source document. For this project, that means requesting and receiving all four files that make up the XHTML 1.0 DTD during every transformation. It's not difficult to configure your environment to read from local files.
The real problem, however, is that the header wasn't changed. Obviously I forgot to account for the XHTML namespace. These changes should fix the problem:
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:xhtml="http://www.w3.org/1999/xhtml"> <xsl:output method="xml" omit-xml-declaration="yes" doctype-public="-//W3C//DTD XHTML 1.0 Strict//EN" doctype-system="http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"/> <xsl:template match="xhtml:div[@id='HEADER']"> <div id="HEADER"> <p>This is the <em>improved</em> header.</p> </div> </xsl:template> <xsl:template match="*|@*"> <xsl:copy> <xsl:apply-templates select="@*"/> <xsl:apply-templates/> </xsl:copy> </xsl:template> <xsl:template match="processing-instruction()|comment()"> <xsl:copy>.</xsl:copy> </xsl:template> </xsl:stylesheet>
Here's the result:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en"> <head> <title>Sample Document</title> <meta http-equiv="Content-Type" content="text/html; charset=utf-8"/> </head> <body> <div xmlns="" xmlns:xhtml="http://www.w3.org/1999/xhtml" id="HEADER"><p>This is the <em>improved</em> header.</p></div> <div id="CONTENT"> <h1>Sample Document</h1> <p>This is a sample document</p> </div> </body> </html>
This looks better. The content seems correct. There are some extra div attributes…. Wait a second. The default namespace of the new div is now the empty string, not http://www.w3.org/1999/xhtml
as it is for the other elements. Saxon added a namespace declaration for http://www.w3.org/1999/xhtml
, even giving it the same prefix that I had set in the stylesheet; however, it didn't apply the xhtml
prefix to any of the elements. This output might render OK in a browser, but the file has been effectively destroyed for subsequent transformations. It will be very difficult now to select, for example, the em element in the div header because its not in the same namespace as the rest of the document. Clearly a subtle but horrible mistake has been made. Let's try to fix this with a namespace alias:
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:xhtml="http://www.w3.org/1999/xhtml"> <xsl:output method="xml" omit-xml-declaration="yes" doctype-public="-//W3C//DTD XHTML 1.0 Strict//EN" doctype-system="http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"/> <xsl:namespace-alias stylesheet-prefix="#default" result-prefix="xhtml"/> <xsl:template match="xhtml:div[@id='HEADER']"> <div id="HEADER"> <p>This is the <em>improved</em> header.</p> </div> </xsl:template> <xsl:template match="*|@*"> <xsl:copy> <xsl:apply-templates select="@*"/> <xsl:apply-templates/> </xsl:copy> </xsl:template> <xsl:template match="processing-instruction()|comment()"> <xsl:copy>.</xsl:copy> </xsl:template> </xsl:stylesheet>
We get this output:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en"> <head> <title>Sample Document</title> <meta http-equiv="Content-Type" content="text/html; charset=utf-8"/> </head> <body> <div xmlns:xhtml="http://www.w3.org/1999/xhtml" id="HEADER"><p>This is the <em>improved</em> header.</p></div> <div id="CONTENT"> <h1>Sample Document</h1> <p>This is a sample document</p> </div> </body> </html>
This looks better. In this version, the default namespace for the div hasn't been changed. It looks like there are no more namespace problems. The extraneous xmlns:xhtml attribute looks odd, but even though the xhtml
prefix is being declared, it's not being used. Even if it were, it's defined to be the same namespace we're using for the rest of the document. What's the harm? I see. The output is no longer valid XHTML. The xmlns:xhtml attribute is not valid for the div element:
/Users/rcrews/Desktop/out.html... Not valid. No declaration for attribute xmlns:xhtml of element div
Maybe we should just exclude the xhtml
namespace prefix from the output with the exclude-result-prefixes attribute of the stylesheet element:
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:xhtml="http://www.w3.org/1999/xhtml" exclude-result-prefixes="xhtml">
That doesn't do anything. Maybe we need to exclude the default name space?
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:xhtml="http://www.w3.org/1999/xhtml" exclude-result-prefixes="#default">
We're grasping at straws. Why is that attribute showing up? It's not doing anything. Why can't it just not be there? It's probably there because Saxon thinks it might need that namespace declared. I'll bet it has something to do with white space. What if I use element and attribute elements rather than the "shortcut" literal HTML output? Let's do this:
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:xhtml="http://www.w3.org/1999/xhtml"> <xsl:output method="xml" omit-xml-declaration="yes" doctype-public="-//W3C//DTD XHTML 1.0 Strict//EN" doctype-system="http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"/> <xsl:namespace-alias stylesheet-prefix="#default" result-prefix="xhtml"/> <xsl:template match="xhtml:div[@id='HEADER']"> <xsl:element name="div"> <xsl:attribute name="id"> <xsl:text>HEADER</xsl:text> </xsl:attribute> <xsl:element name="p"> <xsl:text>This is the </xsl:text> <xsl:element name="em"> <xsl:text>improved</xsl:text> </xsl:element> <xsl:text> header.</xsl:text> </xsl:element> </xsl:element> </xsl:template> <xsl:template match="*|@*"> <xsl:copy> <xsl:apply-templates select="@*"/> <xsl:apply-templates/> </xsl:copy> </xsl:template> <xsl:template match="processing-instruction()|comment()"> <xsl:copy>.</xsl:copy> </xsl:template> </xsl:stylesheet>
Here's the result:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en"> <head> <title>Sample Document</title> <meta http-equiv="Content-Type" content="text/html; charset=utf-8"/> </head> <body> <div xmlns="" id="HEADER"><p>This is the <em>improved</em> header.</p></div> <div id="CONTENT"> <h1>Sample Document</h1> <p>This is a sample document</p> </div> </body> </html>
We've been here before! But the nice thing about being where you started is you know the terrain:
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:xhtml="http://www.w3.org/1999/xhtml"> <xsl:output method="xml" omit-xml-declaration="yes" doctype-public="-//W3C//DTD XHTML 1.0 Strict//EN" doctype-system="http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"/> <xsl:namespace-alias stylesheet-prefix="#default" result-prefix="xhtml"/> <xsl:template match="xhtml:div[@id='HEADER']"> <xsl:element name="div" namespace="http://www.w3.org/1999/xhtml"> <xsl:attribute name="id"> <xsl:text>HEADER</xsl:text> </xsl:attribute> <xsl:element name="p" namespace="http://www.w3.org/1999/xhtml"> <xsl:text>This is the </xsl:text> <xsl:element name="em" namespace="http://www.w3.org/1999/xhtml"> <xsl:text>improved</xsl:text> </xsl:element> <xsl:text> header.</xsl:text> </xsl:element> </xsl:element> </xsl:template> <xsl:template match="*|@*"> <xsl:copy> <xsl:apply-templates select="@*"/> <xsl:apply-templates/> </xsl:copy> </xsl:template> <xsl:template match="processing-instruction()|comment()"> <xsl:copy>.</xsl:copy> </xsl:template> </xsl:stylesheet>
These changes find success:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en"> <head> <title>Sample Document</title> <meta http-equiv="Content-Type" content="text/html; charset=utf-8"/> </head> <body> <div id="HEADER"><p>This is the <em>improved</em> header.</p></div> <div id="CONTENT"> <h1>Sample Document</h1> <p>This is a sample document</p> </div> </body> </html>
Turns out the namespace-alias element was a red herring:
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:xhtml="http://www.w3.org/1999/xhtml"> <xsl:output method="xml" omit-xml-declaration="yes" doctype-public="-//W3C//DTD XHTML 1.0 Strict//EN" doctype-system="http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"/> <!-- <xsl:namespace-alias stylesheet-prefix="#default" result-prefix="xhtml"/> --> <xsl:template match="xhtml:div[@id='HEADER']"> <xsl:element name="div" namespace="http://www.w3.org/1999/xhtml"> <xsl:attribute name="id"> <xsl:text>HEADER</xsl:text> </xsl:attribute> <xsl:element name="p" namespace="http://www.w3.org/1999/xhtml"> <xsl:text>This is the </xsl:text> <xsl:element name="em" namespace="http://www.w3.org/1999/xhtml"> <xsl:text>improved</xsl:text> </xsl:element> <xsl:text> header.</xsl:text> </xsl:element> </xsl:element> </xsl:template> <xsl:template match="*|@*"> <xsl:copy> <xsl:apply-templates select="@*"/> <xsl:apply-templates/> </xsl:copy> </xsl:template> <xsl:template match="processing-instruction()|comment()"> <xsl:copy>.</xsl:copy> </xsl:template> </xsl:stylesheet>
Removing it yields the same results. The ultimate solution to effectively converting XHTML to XHTML with XSLT is to be sure to wrap your result-tree changes in XSLT elements, making sure to explicitly identify the namespace of every element.