Fragments – Wikipedia to Markdown

I’ve been sketching some ideas, pondering the ethics of doing an F1 review style book blending (openly licensed) content from Wikipedia race reports with some of my own f1datajunkie charts, and also wondering about the extent to which I could automatically generate Wikipedia style race report sentences from the data; I think the sentence generation, in general should be quite easy – the harder part would be identifying the “interesting” sentences (that is, the ones that make it into the report, rather than than the totality of ones that could be generated).

So far, my sketches have been based around just grabbing the content from Wikipedia, and transforming to markdown, the markup language used in the Leanpub workflow:

In Python 3.x at least, I came across some encoding issues, and couldn’t seem to identify Wikipedia page sections. For what it’s worth, a minimal scribble looks something like this:

!pip3 install wikipedia
import wikipedia

#Search for page titles on Wikipedia
wikipedia.search('2014 Australian grand prix')

#Load a page
f1=wikipedia.page('2014 Australian Grand Prix')

#Preview page content
f1.content

#Preview a section's content by section header
f1.section('Qualifying')
##For some reason, f1.sections shows an empty list for me?


#pandoc supports Wikimedia to markdown conversion
!apt-get -y install pandoc
!pip3 install pypandoc
import pypandoc

#To work round encoding issues, write the content to a file and then convert it...
f = open('delme1.txt', 'w', encoding='utf8')
f.write(f1.content)
f.close()

md=pypandoc.convert('delme1.txt', 'md', format='mediawiki')

If the Formula One race report pages follow similar templates and use similar headings, then it should be straightforward enough to pull down sections of the reports and interleave them with charts and tables. (As well as issues parsing out section headers to fill the sections list, the tables on the page don’t appear to be grabbed into the .content field (assuming the API wrapper does manage to grab that content down? However, I can easily recreate those from things like the ergast API).

Looking at the construction of sentences in the race reports, many of them are formulaic. However, as noted above, generating sentences is one thing, but generating interesting sentences is another. For that, I think we need to identify sets of rules that mark data features out as interesting or not before generating sentences from them.