05.26.05

Structure on the Web: A Survey

Posted in General at 11:49 pm by Todd

Okay, having made a long-winded setup in previous posts, I want to delve into the real substance of the matter. If you accept the idea that adding structure (read: semantics) to content on the web will open up grand new possibilities, making content more accessible and useful, the question is: what’s the best approach? What follows is a brief survey of the various alternatives currently under development and discussion.

RDF & The Semantic Web

I touched on the Semantic Web in the previous post. From the prospective of a theoretical consumer of information, the Semantic Web is the gold standard of structured information representation. Alas, in reality, there are no commonly available tools for doing much with Semantic Web content. Piggy Bank is an interesting demonstration piece, but it’s not really clear what the next step is here.

On the authoring side, similarly, producing RDF content for the Semantic Web is a pain. Tools would help, but to date there have been a dearth of those, surprising given the amount of time that Semantic Web efforts have been underway.

Structured Blogging

Structured Blogging is another approach to authoring content with structure a la the Semantic Web. Proposed by PubSub and motivated by Bob Wyman’s excellent post on the macro value of structured content, Structured Blogging relies on a syntactic sleight of hand to embed well-formed XML content into standard HTML pages. PubSub has defined a set of vocabularies for describing blog entries, events, and reviews (the Holy Trinity of structured content?) and, perhaps more importantly, developed a WordPress plug-in to enable the creation and editing of content of any of these three types.

So how does Structured Blogging measure up? Authoring content is a snap — provided you’re using WordPress. If you’re not using WordPress, you’ll need to use WordPress. Or get someone to develop a plug-in for your authoring tool. Because (like RDF) you will not want to code this by hand.

On the consumption side, I haven’t found software that lets you do anything interesting with SB content. Since Structured Blogging content is well-defined XML using namespaces, it’s easily converted to RDF, for what it’s worth. (Nothing, as near as I can tell.)

MicroFormats

Of the formats surveyed, MicroFormats appear to have generated the greatest enthusiasm among the developer crowd (except perhaps amongst the hard-core RDF’ers, who, I gather, view it as too watered-down and weak). Sponsored by Technorati, MicroFormats take the simple approach of using enhanced markup to indicate semantic structure in XHTML documents. In doing so, MicroFormats leverage existing software that processes (X)HTML: in the worst case, unaware processors simply pass MicroFormats like a bird passes a seed. (No, that’s not an editorial comment.)

But, you might fairly ask, if using MicroFormats means that you embed semantic content directly into XHTML, how do you keep the machine-readable part separate from the displayable part? There’s a two-part answer. First, if you use MicroFormats, you’ll use CSS to control the presentation of content to human readers. For the most part, this means using CSS to make invisible to human readers those portions of the MicroFormat that aren’t display-worthy but are required for machine consumption. For cases where the machine requires data in a form that’s inappropriate for display, MicroFormats use a hack based on the <abbr> tag.

MicroFormats are pretty easy to author; folks with decent HTML skills can conceivably do it by hand. And there are already a couple of simple tools that allow an author to fill out form fields, press a button, and get the requisite markup.

On the consumption side, there’s been some grousing that extracting semantic content from XHTML is hard, but frankly I don’t understand what the difficulty is. A parser is a parser is a parser. When people mess up their markup, parsers will fail, and that’s okay.

Danny Ayers has done some work to create a bridge between MicroFormats and RDF. Perhaps this will allay the concerns of the RDF folks who view MicroFormats as being too loose.

Inferred Semantics

The last category of semantic representation I’ll mention is actually unsemantic representation: it’s the idea (debunked in a previous post) that someone with a lot of spare CPU cycles on their hands will consume plain ol’ non-semantic content and calcuate semantics using advanced textual-analysis techniques. This approach has the advantage that it doesn’t require anyone to change their content. But it sure does dump a lot of work in the laps of consumers. In this respect, it’s the antithesis of RDF.

Conclusion

It’s hard to say where any of this is going. For the moment, it seems like there’s a chicken/egg conundrum: how do you persuade content authors to expend extra energy so their content can sing in some new consumer tool or service? How do you drive acceptance of a new tool or service before there’s any content to populate it? I don’t have an answer to that question. But if you’re promulgating one of the above approaches to getting semantic content on the web, try this on for size, courtesy of Voidstar:

So here’s some advice for potential standard architects. To get your standard implemented, you need

  1. A written non-ambiguous doc in the style of an RFC with lots of examples
  2. Example Apps and proofs of concept for both the publisher and the subscriber that answers a real need
  3. Toolkit libraries in all the major languages and environments. perl *and* php. Java *and* C#
  4. A community of evangelists who go out and spread the word and put effort into persuading likely publishers and subscribers to support your new standard.

Is that the ticket? It may be.

3 Comments »

  1. Danny said,

    May 28, 2005 at 2:45 am

    Hi Todd,
    Nice post. But I think we’re nearly out of the chicken/egg situation. On the producer side, don’t forget there’s RSS, which, even if it’s RSS 2.0, is structured enough to be consumable by RDF-based apps. The GRDDL approach to getting RDF from XHTML content (basically running XSLT on them) does offer an easy route to RDF for microformats and (I believe, I’ve not tried it yet) the structured blogging material. There are other kinds of content too - you know every piece of material to come out of Adobe tools (Photoshop etc) now contains embedded RDF?

    On the consumer side, yep, there is still a lack of big cool applications. But again, the RSS systems point to what can be done with a single vocabulary, the sky’s the limit when you bring in others, and take advantage of the data integration RDF supports. Thing is, once you have a system based on a flexible model (like RDF), it makes adding new stuff a whole lot easier than building on more rigid data models like SQL DBs or XML docs.

    Re. the lots-of-cycles textual analysis point - I don’t think this is necessarily antithetical to the structured approach. If you can do statistical/machine learning analysis with your free cycles, you can integrate the results of those analyses with the rest of your structured data. It’s easy enough to represent things like similarity measures or whatever in RDF.

    What I’m hoping will really get things moving in the next year or so is the SPARQL Protocol and RDF Query Language. In many cases it’s relatively easy to take data from the Web and extract/create RDF from it. So you can easily fill an RDF store with this stuff. Doing interesting things from there was, until recently, quite code-intensive. But SPARQL provides a comparatively easy way of making queries on the stuff - to anyone that’s used SQL it’s obvious how to use it, except the difference is that the store can contain all kinds of relations in the data, not just the ones set up in advance. Very rapid/agile development is possible.

  2. Danny said,

    May 28, 2005 at 2:49 am

    PS. A (free) tool that combines text analysis with structure data is Aduna AutoFocus: http://aduna.biz/products/autofocus

  3. Chao's Blog said,

    May 31, 2005 at 10:09 pm

    Structure on the Web

    I’ve been grazing at Todd Agulnick’s excellent blog about organizing and structuring the web and especially about the “lower case” semantic web. There’s a great intro entry in which he surveys 4 different leading efforts to add structure to conten…

Leave a Comment