Coding with Jesse

Who will read your Semantic HTML?

January 3rd, 2007

I've talked about Semantic HTML before, and many other people have. But the one thing I find missing in these discussions is an explanation of why we should use Semantic HTML, or more specifically, who or what it will be that later reads your Semantic HTML to extract meaning from it.

Semantic HTML is all about adding meaning to your document by using appropriate HTML elements. It's a great concept. Why waste time and space with unnecessary <div> and <span> elements when other more meaningful elements are available like <h1> or <label>?

Certainly, Semantic HTML has many practical benefits. <label> has usability and accessibility benefits, like allowing people to click some text to check a checkbox, or allowing people with screenreaders understand what a text input is for. Headers like <h1> give a document structure by allowing hierarchical naming of the sections of a document. <title> lets you give a document a title. And of course, <a> lets you create hyperlinks which tie web pages together.

All these are very clear and understandable benefits and ways of using Semantic HTML. I find there are other benefits to using a variety of elements, like being able to work with the HTML and CSS of a document more easily. If a document was make completely with <div>s, you'd spend a lot of time giving things class names and IDs unnecessarily, and you spend a lot of time trying to figure which </div> goes with which <div>.

Now what really bugs me is when people start arguing about the semantic meaning of a certain element. Recently on Snook was a discussion of the Use of ADDRESS Element. Now, I understand where such discussions are coming from. The W3C HTML specs do try to define what each element is supposed to be used for. But if such a definition isn't totally clear, then you really can't use the element for anything. And you know a definition is vague when a dozen semantic and standards enthusiasts can't quite agree. And if these people can't agree, then what about the millions of other people who make web pages and haven't even heard of the W3C?

I believe it comes down to practicalities. For example, with the <address> element, you could argue that the element should only be used for web page contact information because the specs imply this. But what will this allow us to do? Is someone going to make a tool that looks for <address> elements on a page and uses it to let you contact the owner of the page? I doubt it, except maybe spam email harvesters. And even if there was, the tool would find itself to be nearly useless since so many web pages are likely using <address> for a wide variety of purposes that have little to do with web page contact info.

And what about other elements that don't even have a specific use like ordered, unordered and definition lists? I find it hard to imagine a scenario where the semantic meaning implicit in a list of items can be utilized. It's possible Google Sets uses lists this way, but chances are it mostly uses comma-separated words. And maybe the definitions feature of Google could use definition lists, but most of the results come from sites that don't use definition lists at all.

Well this brings me to the point I wanted to make. We need to think about who or what it is that will actually be extracting the meaning we're adding to our documents by using Semantic HTML. And basically I can think of three groups:

  1. Web developers

    Yourself or others that will actually be reading and working with the HTML you produce. For this purpose, class names, IDs and elements all add semantic meaning or at least readability to a document. This makes it easier to work with the HTML, and understand what each element represents structurally.

  2. Search engine spiders and other bots

    These are tools that read a large number of web pages and try to extract some meaning from them. Search engines understand that text in titles, meta tags, links and headers is special. Technorati's Microformats Search is another great example of semantics being utilized.

  3. Web browsers, screen readers and other clients

    These understand what many of the different elements are for and allow the visitor to interact with these elements in a unique way, like with a checkbox or link. Also, the client can communicate semantics to a visitor by displaying elements a certain way, like numbering the items of an ordered list. However, this semantic communication can be messed with by using CSS. An ordered list with list-style: none and list items floated will communicate no semantics to a user of a visual web browser.

In terms of these three groups of web page users, try to think of what difference it will make if a <dd> gets used for a blog post body instead of strictly a word definition. If the semantics can't be used or even considered to really mean anything, then can they even be considered semantic?


Interested in web development? Subscribe to my newsletter!

Comments

1 . Jonathan Snook at 2007-01-04T02:35:11.000Z

Jonathan Snook

The reason why much of this debate occurs is because we want (need?) a consensus. It's the chicken vs egg thing. We need to establish a base from which quality tools can be built on top of. This is why microformats are taking off. Who cares if I use a class called "telephone" or "tel" or "classA"? They all do the same thing ... until tools can extract reliable data, but it's not reliable until there's a consensus.

So, web standards establish a baseline. Microformats establish a baseline. Then, tools can take advantage of them. Then, you can automatically book an event in Upcoming.org from a date in another web page. Then, you can add a contact to Outlook with the vcard data embedded in the page.

I think address is particularly maligned because the element's name seems to evoke so much meaning that one would think it obvious what it should do.

2 . Jason Barnabe at 2007-01-04T05:06:09.000Z

Jason Barnabe

Another advantage is sane rendering when a stylesheet is not applied. Headers look like headers, lists look like lists, etc.

I think you're being a bit close-minded when it comes to possible uses of semantic HTML to bots. For example, a theoretical "Contact the webmaster" bot or extension doesn't need to have set data to make use of an address element - a possible algorithm could be
1. Look for anchors with a mailto: href in an address element
2. Look for text like *@*.* in an address element
3. Look for anchors with a mailto: href elsewhere
4. Look for text like *@*.* elsewhere
So for this bot/extension, the address element is certainly useful for it to reduce "false positives", but address elements used for other purposes and a lack of address elements don't trip it up. I'd find it surprising if Google Sets and Google Definitions *didn't* make use of semantic data in this manner. Even if W3C came out and said "put e-mail addresses in the address element", you'd still have to deal with all the same issues.

I don't see any downsides to being as semantic as possible other than possibly having to override the default CSS. With so many upsides, both theoretical and practical, why wouldn't you semantic as possible?

3 . Emil Stenström at 2007-01-04T22:18:58.000Z

Emil Stenström

I tend to look at things a little differently. I believe websites should be written for humans not robots. Robots can be given info in other ways, link to a .vcard file instead of pushing it in with strange classnames.

As a web developer I think the biggest reason to use semantic HTML is to "do things the right way". In programming you don't repeat yourself in your code, extract methods and call them. With CSS don't define the style over and over again in the HTML, you extract it to a separate file and link it. Semantic HTML is a lot like (declarative) programming, and I think it should be compared to that.

4 . Jonathan Snook at 2007-01-05T01:40:55.000Z

Jonathan Snook

Emil: Using vCard as an example specifically, the problem is that a browser can't render a vCard and they may not have an application that understands vCard. It can, however, render HTML just fine. So, a microformat still creates something that is flexible and usable by browsers and users but adds a consistent layer that allows applications to make use of it, too. And it saves you from duplicating contact information in the page AND in a vCard (D.R.Y.!).

5 . Keith Alexander at 2007-05-08T12:04:00.000Z

Keith Alexander

Jonathon: Emil has a valid point. In the excitement over the possibilities of aggregating microformats, it often seems to be forgotten that vcard and ical files are already published in large numbers on the web, and have very good tools for creating them.

In many cases, it makes more sense to use a script to generate html from the vcard or ical, and link to the original file, than it does to try to start with the html and generate a machine readable format. You/your client can use existing calendar and address book apps, and you don't have to compromise between accessibility and information loss (ie: the abbr[@title] hack).

That said, Emil, I don't understand the humans vs. machines dichotomy. Web pages are necessarily processed by machines for the value of humans, so what's wrong with increasing the value to humans by making it easier for machines where you can?

6 . lewis litanzios at 2008-03-28T00:16:31.000Z

lewis litanzios

slightly off topic considering the way the comments, albeit very interesting, are going i know, but i always wondered whether it mattered how you name your classes/IDs?

is 'camelCase' any different to using an 'under_score' in terms of how machines will interpret your semantic conventions? i feel more conformable using camelCase these days, but do get slightly jealous of under_scores when i see them sometimes, for some strange reason (don't ask me why)? i've gone off using hyphens since learning XML best practices.

i did think about blogging this myself, but it did occur to me it would be rather a short post. i think there's already been enough written on semantics recently to be jumping on the wagon.

thanks for raising this issue jesse.

ps. first i've heard of google sets - could this be used for generating meta keywords?

pps. do you 'ping' to technorati (http://technorati.com/ping) out of interest?

7 . Jesse Skinner at 2008-03-29T18:43:37.000Z

Jesse Skinner

@lewis - Class names are only semantic in terms of communicating with other designers/developers working on the code. The only time machines/bots really care about class names is when dealing with microformats or other pre-defined meanings, and in that case the format is also unimportant as long as it's documented. Google, for example, doesn't search/index class names.

ps. sure, try it out. Google also has a keyword selector tool.

pps. I used to manually, but I get so little traffic from technorati (a few hits a month max) that I don't usually bother. My blog code is handrolled and I haven't bothered to build a ping tool.

8 . lewis litanzios at 2008-03-29T19:20:45.000Z

lewis litanzios

safe jesse, thanks for the heads up :)

9 . Cooper at 2008-09-03T02:48:22.000Z

Cooper

While on topic of microformats I think they are going in the right direction but only apply to certain types of data such as address, geo, and so fourth... With this in mind I think class/id naming needs to be standardized as well. For example, <div id="wrapper"> means absolutely nothing. Maybe, if wrapper was changed to <div id="page-content"> this would be more semantic. We need to entice a new movement in regards to a new semantic web.
@Snook - What's your thoughts about the semantics of naming conventions?

10 . brian at 2009-02-09T22:47:49.000Z

brian

is the div really out of place on a page? It means there is a logical division in the page and therefore if you have two columns for instance, each column should be inside a div because they are seperated from eachother.

11 . Lewis Litanzios at 2009-02-09T23:33:45.000Z

Lewis Litanzios

2009 and now semantic naming is VERY important I feel.

I have a list on my wall now with a list of semantic naming conventions I put together from a number of articles around the web the other day. The best of, and most comprehensive being this: http://www.stuffandnonsense.co.uk/archives/whats_in_a_name_pt2.html by Andy Clarke.

CSS signatures, Microformats, and even clients asking for this s**t now, so I'm rolling with it. Plus if you use SNCs it comes off like you're an organised f**ker too, not to mention it's great for human readability when you start pilfering jQuery functions (Jesse will agree with me here no doubt).

Excuse my swearing, this is practically the last sentence I will write before bed today :|

12 . Aaron at 2010-03-11T15:43:41.000Z

Aaron

One thing that always bugs me is when I see people using block-level elements wrapped inside of DIV tags.

Something like this:
<div class="heading">
<h2>The heading</h2>
</div>

or worse yet:
<div class="heading">
<img src="theheadingimg.jpg" />
</div>

While we can all nitpick about microformatting and argue about the merits of what should go inside of an address tag, can we at least all agree that inventing new CSS classes to duplicate the function of perfectly usable HTML tags is ridiculous?