Exporting Markdown and XML From Google Docs

Just over a year ago, we started production of a new OU course using Google docs as the medium within which we’d share draft course materials. This was something of an experiment to see whether the online social document medium encouraged sharing and discussion of ideas, resources, ongoing feedback and comment of the work in progress amongst the course team, rather than, or at least, in addition to, the traditional handover of significant chunks of the course at set handover dates. (In case you’re wondering, it didn’t…)

The question we’re now faced with is, how do we get the content that’s in Google docs into the next stage of the OU document workflow.

The actual HTML based course materials that appear in the VLE (or the ebook versions of them, etc) are generated automatically from an “OU Structured Content” XML document. The XML documents are prepared using the <oXygen/> XML editor, extended with an OU structured content framework that includes the requisite DTDs/schema files, hooks for rendering, previewing and publishing into the OU environment and so on. (Details about the available tags can be found here: tag guide [OU internal link].)

Whilst the preferred authoring route is presumably that authors use the <oXygen/> editor from the start, the guidance also suggests that many authors use Word (and an appropriate OU style sheet) and then copy and paste content over into the XML editor, at which point some amount of tidying and retagging may be required.

As Google docs doesn’t seem to support the addition of custom style elements or tags (users are limited to customising the visual style of provided style elements), we need to find another way of getting content from Google docs and into <oXygen/>. One approach would be to grab a copy of the Google doc into Microsoft Word, apply an OU template to mark up the content using appropriate custom style elements, and then copy the content over to the XML editor, upon which point it will probably need further tidying.

Another approach is to try to export the data as an XML document. Looking around, I found a Google Apps script script (sic) that allows you to export the content in a Google doc as markdown. Whilst markdown documents don’t have the same tree-like document structure as XML, it did provide an example of how to parse a Google docs document. My first attempt at a script to export a Google doc in the OU XML format can be found here: export Google doc as OU SC-XML. (Note: the script is subject to change; now I have a basic operational/functional spec, I can start to try to tidy up the code and try to parse out more structure…)

Having got a minimal exporter working, the question now arises as to where effort needs to be spent next. The exporter produces a minimal form of OU-XML that sometimes validates and (in early testing) sometimes doesn’t. (If the script is working properly it should produce output that always validate as XML; then it should produce output that always validate as OU-XML). Should time be spent improving the script to produce better XML, or can we live with the fact that the exporter gets the document some way in to <oXygen/>, but further work will be required to fix a few validation breaks?

Another issue that arises is how rich a form of OU-XML we try to export. When working in Microsoft Word environment, a document style can be defined using elements that map onto the elements in OU-XML. When working in Google docs, we need to define a convention that the parser can respond to.

At the moment, the parser is sensitive to:

  • HEADING1: treated as Session;
  • HEADING2: treated as Section;
  • HEADING3: treated as SubSection;
  • LIST_ITEM: numbered and unnumbered lists are treated as BulletedList; sublists to depth 1 are supported as BulletedSubsidiaryList
  • coloured text: treated as AuthorComment; at the moment this may incorrectly grab title elements as such;
  • INLINE_IMAGE: images are rendered as relatively referenced Figure elements with an empty Description element. A copy of the image is locally stored. (Note: INLINE_DRAWING elements are unsupported – there’s no way of exporting them; maybe I should export an empty Figure with a Description saying there’s a missing INLINE_DRAWING?)
  • TABLE: rendered as Table with empty TableHead;
  • bold, italic, LinkUrl: rendered as b, i and a tags respectively;
  • font.COURIER_NEW: rendered as ComputerCode.
  • By convention, we should be able to detect and parse things like activities, exercises and SAQs. Some mechanism needs to be supported for identifying the block elements in such cases. For example, one convention might take the form:

    Exercise N

    Discussion

    End

    The ^Exercise and ^End$ elements denote the block; heading style (eg HEADING4 for the ^Exercise (and perhaps HEADING5 for the ^End$?) could further aid detection?

    Another approach would be to use horizontal lines to denote the start and stop of a block. For example:


    SAQ

    My Answer

    where represents a horizontal line and denotes the start and end of a block. Again, heading styles within the block could either identify or reinforce a particular block element type.

    Rendering a preview of the OU-XML as it would appear in the OU VLE is possible by uploading the OU-XML file, or a zip file containing it and related assets, to an OU URL that sits behind an OU authentication care. The <oXygen/> editor handles previews by using the default web client – your default browser – to post a selected XML document to the appropriate OU upload/preview URL. Unfortunately, the functions that allow http POST operations from Google app script run on Google servers, which means that we can’t just create a button in Google docs that would post and XML export of the current Google doc to the authentication-required OU URL. (I don’t know if this would be possible in an OU/Google apps domain?).

    I’m not sure if a workaround would be to launch a preview window in the browser from Google docs containing a copy of the OU XML version of a document, highlighting the XML, then using a bookmarklet to post the highlighted XML to the OU preview service URL within the browser context using a browser where the user has logged in to the OU web domain? Alternatively, could a Chrome application access both content from Google Drive and then post to the authenticated OU preview URL using browser permissions? (That is, can a Chrome app access both Google Drive using a user’s Google permissions, or machine access permissions, and then post content grabbed from that source to the OU URL using permissions granted to the browser?) As ever, my ignorance about browser security policies on the one hand, and the Google Apps/Chrome apps security model on the other, make it hard to know what workarounds might be possible.

    If any members of OU staff would like to try out the exporter, please get in touch for hints, or let me know how you get on:-) In addition, if any members of OU staff are using Google docs for course production, I’d love to know how you’re using them and how you’re getting on:-)

    PS via @mahawksey, I see that we can associate metadata with a doc using PropertiesService.getDocumentProperties(). Could be handy for adding things like course code, author metadata, publishing route, template etc to a doc. I’m not sure if we can also associate metadata with a folder, though I guess we could also include a file in a folder that contains metadata relating to files held more generally within the same folder?

    Author: Tony Hirst

    I'm a Senior Lecturer at The Open University, with an interest in #opendata policy and practice, as well as general web tinkering...

%d bloggers like this: