Categorized: Markup
XHTML vs HTML
If XHTML served as text is really just invalid HTML that renders predictably, then why does a document with a HTML 4.01 strict doctype still validate with XHTML type self-closing tags (<img src=”#” />)?
Update (Oct 25, 2006): I like this!
Comments
Because the forward slash is a technically valid way of closing the IMG tag in HTML. If you validate such a page with the W3 validator and “Show Parse Tree” turned on, you’ll see a literal “>” after it.
The fact that no major browser will ever parse like the validator is one of those wonderful tag soup details. :)
So, when the author of that article says:
That isn’t technically correct, because most of XHTML-type markup will validate as HTML. The only self-closing tag that I’ve found that HTML validation isn’t happy with are
metatags. Maybe self-closing tags and other XML-style markup just isn’t as “proper” for plain HTML? I don’t know.The reason that <img src="#" /> is parsed as valid HTML is that the forward slash is parsed as an attribute name with no value. As the forward slash is not a valid attribute name, it is ignored. This trick only works for elements which HTML declares as self-closing tags (<img> and <br> being the most common). To be clear, in HTML the trailing forward slash in no way indicates the closing of a tag.
Trey, there are many instances in which XML-style "self-closing" tags will break when parsed using an HTML parser. The <script /> example is mentioned in the WebKit blog post, and reference is made to the fact that a construct such as "<p><b />Test</p>" will result in a bold word when parsed as HTML but non-bold when parsed as XHTML.
I can assure you that Maciej is quite correct in stating that XHTML served as text/html is leaning very heavily on HTML parsers error handling.
[sigh, can you please remove my initial comment with literal HTML tags in it? It'd be good if the comment form had some indication of the format it expects input in ;-)]
If a self-closing tag isn’t a valid attribute, why does the W3C validator let it go without so much as a warning?
Try validating the following HTML document in the W3C validator with the “Show parse tree” option enabled. It shows that it parses everything after the forward-slash as being the content of the tag. This explains the extra > that Robert Marshall mentions seeing when validating an XML-style “self-closing” tag. Compare how the W3C validator parses this with how web browsers render it: the web browsers that I have tested all render “Text” as bold, indicating that the style attribute was parsed as part of the P element.
Test document:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd"> <html> <head> <title>Test Page</title> </head> <body> <p /="a" style="font-weight: bold;">Text</p> <p>More text</p> </body> </html>But in any realistic use, it doesn’t matter. Nobody writes markup like that, do they? I understand what you’re saying, but that kind of thing is unlikely to matter if the person writing the markup has any sense at all.
And again I come to the same conclusion that I’ve had for quite some time:
It doesn’t matter as long as you use a strict doctype.
@Trey
The validator lets the slash “/” go because they have been updated to be aware of the XHTML syntax. The issue is all overblown FUD. It doesn’t matter whether the doctype is strict or transitional. It is part of W3C recommendations to serve XHTML as mime type text/html.
What I mean is that if you’re going to use the “best” markup, it should be strict–whether X or not. I officially don’t care about the issue of X or not, as long as it’s strict. Presentational markup is for squares.
Rob, the validator doesn’t let the forward slash go. It parses the HTML document according to the HTML standard, which mandates using the SGML rules for tag parsing. If you take a look at the parse tree produced from the W3C validator for a “Valid HTML 4.01 Strict” document and compare it to how web browsers render it you will notice that they disagree with how the forward slash as part of attributes is handled. Web browsers ignore it in order to allow “HTML-compatible” XHTML to be parsed correctly, while the W3C Validator treats the forward slash as the end of the tag and everything after it becomes contents of the element. The “HTML-compatible” XHTML relies on this non-standard handling of forward slashes in browser. If the W3C validator had been updated as you claim, the issue with the extra “>” at the end of <img /> that Robert Marshall notes in the first comment would not exist.
What do you think about that?