XHTML vs HTML

Posted by Trey on October 01, 2006

If XHTML served as text is really just invalid HTML that renders predictably, then why does a document with a HTML 4.01 strict doctype still validate with XHTML type self-closing tags (<img src=”#” />)?

Update (Oct 25, 2006): I like this!

Trackbacks

Use this link to trackback from your own site.

Technorati

View blog reactions

Comments

Leave a response

  1. Robert Marshall Sun, 01 Oct 2006 09:59:08 PDT

    Because the forward slash is a technically valid way of closing the IMG tag in HTML. If you validate such a page with the W3 validator and “Show Parse Tree” turned on, you’ll see a literal “>” after it.

    The fact that no major browser will ever parse like the validator is one of those wonderful tag soup details. :)

  2. Trey Sun, 01 Oct 2006 21:44:04 PDT

    So, when the author of that article says:

    the vast majority of supposedly XHTML documents on the internet are served as text/html. Which means they are not XHTML at all, but actually invalid HTML that’s getting by on the error handling of HTML parsers.

    That isn’t technically correct, because most of XHTML-type markup will validate as HTML. The only self-closing tag that I’ve found that HTML validation isn’t happy with are meta tags. Maybe self-closing tags and other XML-style markup just isn’t as “proper” for plain HTML? I don’t know.

  3. Mark Rowe Mon, 02 Oct 2006 07:31:09 PDT

    The reason that <img src="#" /> is parsed as valid HTML is that the forward slash is parsed as an attribute name with no value. As the forward slash is not a valid attribute name, it is ignored. This trick only works for elements which HTML declares as self-closing tags (<img> and <br> being the most common). To be clear, in HTML the trailing forward slash in no way indicates the closing of a tag.

    Trey, there are many instances in which XML-style "self-closing" tags will break when parsed using an HTML parser. The <script /> example is mentioned in the WebKit blog post, and reference is made to the fact that a construct such as "<p><b />Test</p>" will result in a bold word when parsed as HTML but non-bold when parsed as XHTML.

    I can assure you that Maciej is quite correct in stating that XHTML served as text/html is leaning very heavily on HTML parsers error handling.

    [sigh, can you please remove my initial comment with literal HTML tags in it? It'd be good if the comment form had some indication of the format it expects input in ;-)]

  4. Trey Mon, 02 Oct 2006 11:15:33 PDT

    If a self-closing tag isn’t a valid attribute, why does the W3C validator let it go without so much as a warning?

  5. Mark Rowe Mon, 02 Oct 2006 19:42:22 PDT

    Try validating the following HTML document in the W3C validator with the “Show parse tree” option enabled. It shows that it parses everything after the forward-slash as being the content of the tag. This explains the extra > that Robert Marshall mentions seeing when validating an XML-style “self-closing” tag. Compare how the W3C validator parses this with how web browsers render it: the web browsers that I have tested all render “Text” as bold, indicating that the style attribute was parsed as part of the P element.

    Test document:

    <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
    <html>
      <head>
        <title>Test Page</title>
      </head>
      <body>
        <p /="a" style="font-weight: bold;">Text</p>
        <p>More text</p>
      </body>
    </html>
    
  6. Trey Tue, 03 Oct 2006 00:36:01 PDT

    But in any realistic use, it doesn’t matter. Nobody writes markup like that, do they? I understand what you’re saying, but that kind of thing is unlikely to matter if the person writing the markup has any sense at all.

    And again I come to the same conclusion that I’ve had for quite some time:

    It doesn’t matter as long as you use a strict doctype.

  7. Rob Burns Wed, 04 Oct 2006 17:27:01 PDT

    @Trey

    The validator lets the slash “/” go because they have been updated to be aware of the XHTML syntax. The issue is all overblown FUD. It doesn’t matter whether the doctype is strict or transitional. It is part of W3C recommendations to serve XHTML as mime type text/html.

  8. Trey Wed, 04 Oct 2006 18:11:11 PDT

    What I mean is that if you’re going to use the “best” markup, it should be strict–whether X or not. I officially don’t care about the issue of X or not, as long as it’s strict. Presentational markup is for squares.

  9. Mark Rowe Wed, 04 Oct 2006 19:57:57 PDT

    Rob, the validator doesn’t let the forward slash go. It parses the HTML document according to the HTML standard, which mandates using the SGML rules for tag parsing. If you take a look at the parse tree produced from the W3C validator for a “Valid HTML 4.01 Strict” document and compare it to how web browsers render it you will notice that they disagree with how the forward slash as part of attributes is handled. Web browsers ignore it in order to allow “HTML-compatible” XHTML to be parsed correctly, while the W3C Validator treats the forward slash as the end of the tag and everything after it becomes contents of the element. The “HTML-compatible” XHTML relies on this non-standard handling of forward slashes in browser. If the W3C validator had been updated as you claim, the issue with the extra “>” at the end of <img /> that Robert Marshall notes in the first comment would not exist.

Comments

Live Preview: