Whenever you publish HTML on the web, there is a possibility that someone, somewhere, will think that the information on your page is so interesting or valuable that they want to republish a particular of it elsewhere.
So for example, if you publish a list of calendar events on a webpage, there may well be someone out there who sees the benefit of making that list more widely available, for example by including the dates in another, more comprehensive aggregating calendar. (A great example of this is Jon Udell’s Elm City project; in an academic context, see Jim Groom’s Aggregating Google Calendars.)
In the case of calendar dates, if you’re feeling helpful you can publish the calendar events using a syndication format, such as iCal. If you’re feeling unhelpful, you can write a mess of HTML with no particular structure, putting dates all over the web page in a variety of formats and using a variety of HTML markup. And somewhere in-between, you could publish the information in a semantically meaningful way, where the HTML structure can be used to identify the different components of an event record (event name, date, location, and so on).
Why would you do this? Well, if semantics are included in the page structure, CSS styling might be able to reveal that meaning in an appropriate presentational way, which makes life as a web page designer easier. And depending on how you semantically mark-up your web page, browser add-ons might be able to exploit that structure to provide additional user functionality, such as adding selected calendar dates on a web page to a personal calendar.
Semantics can be added to calendar information in a web page in an informal way as tabulated data, where separate columns in an HTML table might identify things like the event name, date, location, etc. Semantics can also be associated with each element in an event record using a standardised markup convention such as the hCalendar microformat.
A good example of microformats in action can be found on the University of Bath Semester Timetable:
Here’s how each cell (each ‘event’) is represented in the table using hCalendar:
Because hCalendar is a recognised format (by some people at least!), several tools already exist for scraping it in an efficient way from a web page. For example, Brian Suda’s X2V service, “a BETA implementation of an XSLT file to transform and hCa* encoded XHTML file into the corresponding vCard/iCalendar file”.
Generating the iCal feed at the click of a button gives me something I can subscribe to in my desktop calendar:
And here it is:
Brian’s approach is based on the use of XSLT to extract the microformatted data from the page and represent it. Essentially, using microformats ‘in page’ allows pre-defined screenscraping utilties to effectively implement an API on top of the page that exposes particular data contained within it. The W3C GRDDL Recommendation generalises this sort of approach.
Another, more recent take on scraping conventionally (or consistently) marked up information from web pages is Yahoo’s YQL. For some time, Yahoo have been providing tools and utilities form scraping structured data from webpages so that it can be used to augment search results listing (Yahoo SearchMonkey), but YQL takes this a whole step further.
YQL offers a SQL like query language that provides a search query console over the web, as well as individual pages on the web. So for example, we can scrape all the microformatted entries from a webpage using a query of the form:
Here’s the result:
A RESTful URI can be constructed to run this query and return the results as XML or JSON, which can then be used elsewhere.
(Note that YQL can be used equally well to scrape more loosely structured pages – XPATH and CSS Selector statements can both be used in a YQL query to extract the part of the page you want to gt hold of.)
As well as microformats, YQL also sees a wide range of other content as queryable “datatable’s on the web”, and provides a way for developers to define their own datatable interfaces to their own web pages, or equally pages on third party sites.