06.07.05

Writing Microformat Parsers

Posted in Microformats at 2:16 pm by Todd

The embedded microformat example from my previous post got me thinking about different approaches to writing a parser to consume microformats.

A month ago, there were some comments about different parsing approaches here. In thinking about the embedded example above, it’s clear that different approaches can lead to different results (beyond, of course, differences in performance).

You might, for instance, take the approach of using a regular expression to identify a node that indicates the start of a target microformat as Tantek suggested. Once you’ve identified an XML node that contains a microformat, you still have a couple of options.

In one approach, you simply query against the subtree rooted by the identified start node looking for values that are interesting. This is simple code to write but potentially gets you the wrong result — in the embedded example above, for instance, your query might return the wrong “url.”

The other approach, which is a bit more complex because it requires implementing a state machine, actually traverses the subtree looking for interesting bits. If it gets to something it doesn’t understand (or, more likely, something it isn’t interested in), it keeps going. If it does find something that it’s interested in, it dives in to extract the relevant value.

This is where there’s a subtle interaction with the embedding example cited above. If the hCard embedded in the hCalendar microformat is bound to a known property of hCalendar (in the example, hCard is bound to hCalendar:location), then my parser will probably not get confused about the hCard:url property because it has enough state to know that it’s processing a known hCalendar property. Thus my hCalendar parser doesn’t really have to know much of anything about hCard, which is a bit of a relief.

If, however, the hCard is not bound to any of hCalendar’s properties — it’s merely inside it but not explicitly embedded in some known property of hCalendar — then I’ve got a potential problem. Either I have to know about hCard’s definition or I’m going to misinterpret hCard’s url as an hCalendar url.

But I wonder: why would someone embed an hCard inside an hCalendar without binding it (i.e., embedding) inside one of hCalendar’s properties? What would such embedding mean? If there’s no real reason to do this (because it doesn’t really mean anything) then the problem fairly evaporates, I think.

Is this right?

Tags:

3 Comments »

  1. brian said,

    June 9, 2005 at 12:32 pm

    There are loads of questions possed, i hope this answer will help clearify some of them.

    The tool i have been working on X2V uses XSLT to make the transformations. The benefit of this is that ANY program or script, client or server application and consume this XSLT file and make the transformation.

    At the moment there is no XMDP to fully describe the microformat, so this is based on the descriptions provided on the wiki at Technorati. That said, some of this might change.

    The first thing my XSLT looks for is any tag with the class=”vcard” or class=”vevent”. From there it will ONLY look for property elements that are children of that element. So it is not exactly a regular expression, but it is also not exactly a state machine. In your previous example, my X2V would find both class=”url” in the hCal and in the hCard because they both meet the criteria of being children of a vevent.

    The reason X2V is not ‘other microformat aware’ and prevents the parsing of both urls is because the potential list of microformats is open-ended. So it would be difficult to determine if a value is a microformat or just a css style. For example:

    <a href=”url me ex-ref”>suda.co.uk</a>

    That contains an XFN property, an hCa* property, and an ex-ref… is ex-ref a css style or some doc-book microformat reference the parser does not yet know about? Should it ignore, skip, break, or use it?

    One additional thing X2V does do, is allow you to link to a document fragment. So instead of getting every vevent from a page, you can reference a document fragment and get a subset of the vevents on a page.

  2. Todd said,

    June 10, 2005 at 3:40 pm

    Brian,

    Thanks for clarifying. The concept behind X2V is definitely cool; transforming hCalendar and hCard back into their base formats (iCal and vCard respectively) is perhaps the first (only?) publicly available consumer-side application of these microformats.

    Can you offer any insight into the experience of producing the XSLT files? Looks a bit more involved (almost 1500 lines of XSLT for each transformation), than say, Perl or Python, but it’s nice to be able to run XSLT essentially anywhere, server or client.

  3. brian said,

    June 10, 2005 at 5:34 pm

    The XSLT file is pretty big, and probably will just keep growing. This is what happens with an XML template language, you can’t do the fancy obfuscated one-line programs like you can in perl. Also, XSLT is not a language built for mathematical operations, so to simply add one hour to a timestamp requires loads if/then/else to check for hours rolling over days, which roll over months, which roll over years, which could be a leap-year-day… etc. Then you need to pad zeros to make sure it is still a valid UTC timestamp.

    Another template that needed to be built was the ability to escape commas. In iCal/vCard you need to escape commas to \, so there is a recursive template that finds commas, pops the string adds \, and repeats on the rest of the string.

    Also, most of the HTML attributes have several potential places that the encoding can be places. Hrefs, srcs, titles, alts, node values. Each needs to be checked and outputted accordingly.

    Another place that takes ALOT of code are property attributes, such as TEL:type=home,work,pref. Since the way XSLT variables are assigned (not like most programming laguages) They are assigned when you enter/leave the templates. It is a very different way of thinking, so the templates have to do lots of checking to see what they have found and haven’t. Then to compound things, you need to add a comma between each value, so you also need to check to see if there IS another value without an array of ones you have already found and add or don’t add a comma depending on if this is the last value. Try to walk through the code and you’ll see what i mean.

    Finally, there are just alot of properties that need to be checked for, compounded with all the places it needs to look, the XSLT just gets BIG. Alot of the templates are copies of each other, possibly with minor tweakes here and there. I’m not the greatest XSLT programmer, so there are probably some optimizations that can be done, but i’m pretty happy with the overall product so far. The next big task is to take relative URLs (from LOGO and other image references) and convert that to an absolute URL. That will be another big template!

Leave a Comment