Skip to content

OUseful.Info, the blog…

Trying to find useful things to do with emerging technologies in open education and data journalism. Snarky and sweary to anyone who emails to offer me content for the site.

Category: Radical Syndication

Using GetTheData to Organise Your Data/API FAQs?

It’s generally taken as read that folk hate doing documentation*. This is as true of documenting data and APIs as it is of code. I’m not sure if anyone has yet done a review of “what folk want from published datasets” (JISC? It’s probably worth a quick tender call…?), but there have certainly been a few reports around what developers are perceived to expect of an API and its associated documentation and community support (e.g. UKOLN’s JISC Good APIs Management Report and API Good Practice reports, and their briefing docs on APIs).

* this is one reason why I think bloggers such as myself, Martin Hawksey and Liam Green Hughes offer a useful service: we do quick demos and geting started walkthroughs of newly launched services, demonstrating their application in a “real” context…

At a recent technical advisory group meeting in support of the Resource Discovery Taskforce UK Discovery initiative (which is aiming to improve the discoverability of information resources through the publication of appropriate metadata, and hopefully a bit of thought towards practical SEO…) I suggested that a Q and A site might be in order to support developer activities: content is likely to be relevant, pre-SEOd (blending naive language questions with technical answers), and maintained and refreshed by the community:-)

In much the same way that JISCPress arose organically from the ad hoc initiative between myself and Joss Winn that was WriteToReply, I suggested that the question and answer site with a focus on data that I set up with Rufus Pollock might provide a running start to UK Discovery Q&A site: GetTheData.

API connections to OSQA, the codebase that underpins GetTheData, are still lacking, but there are mechanisms for syndicating content from RSS feeds (for example, it’s easy enough to get a feed out of tagged questions out, or questions and answers relating to a particular search query); which is to say – we could pull in ukdiscovery tagged questions and answers in to the UK Discovery website developers’ area.

Another issue relates to whether or not developers would actually engage in the asking and answering of questions around UK Discovery technical issues. Something I’ve been mulling over is the extent to which GetTheData could actually be used to provide QandA styled support documentation for published data or data APIs, concentrating a wide range of data related Q&A content on GetTheData (and hence helping building community/activity through regularly refreshed content and a critical mass of active users) and then syndicating specific content to a publisher’s site.

So for example: if a data/api publisher wants to use GetTheData as a way of supporting their documentation/FAQ effort, we could set them up as an admin and allow them rights over the posting and moderation of questions and answers on the site. (Under the current permissions model, I think we’d have to take it on trust that they wouldn’t mess with other bits of the site in a reckless or malevolent way…;-)

API/data publishers could post FAQ style questions on GetTheData and provide canned, accepted (“official”) answers. Of course, the community could also submit additional answers to the FAQs, and if they improve on the official answer be promoted to accepted answers. Through syndication feeds, maybe using a controlled tag filtered through a question submitter filter (i.e. filtering questions by virtue of who posted them), it would be possible to get a “maintained” lists of questions out of GetTheData that could then be pulled in via an RSS feed into a third party site – such as the FAQ area of a data/api publisher’s website.

Additional activity (i.e. community sourced questions and answers) around the data/API on GetTheData could also be selectively pulled in to the official support site. (We may also be able to pull out the lists of people who are active around a particular tag???) In the medium term, it might also be possible to find a way of supporting remote question submission that could be embedded on the API/data site…

If any data/API publishers would like to explore how they might be able to use GetTheData to power FAQ areas of their developer/documentation sites, please get in touch:-)

And if anyone has comments about the extent to which GetTheData, or OSQA, either is or isn’t appropriate for discovery.ac.uk, please feel free to air them below…:-)

Author Tony HirstPosted on June 20, 2011Categories Anything you want, Radical SyndicationTags getthedata, osqa, rdtf, ukdiscovery2 Comments on Using GetTheData to Organise Your Data/API FAQs?

An R-chitecture for Reproducible Research/Reporting/Data Journalism

It’s all very well publishing a research paper that describes the method for, and results of, analysing a dataset in a particular way, or a news story that contains a visualisation of an open dataset, but how can you do so transparently and reproducibly? Wouldn’t it be handy if you could “View Source” on the report to see how the analysis was actually done, or how the visualisation was actually created from an original dataset? And furthermore, how about if the actual chart or analysis results were created directly as a result of executing the script that “documents” the process used?

As regular readers will know, I’ve been dabbling with R – and the RStudio environment – for some time, so here’s a quick review of how I think it might fit into a reproducible research, data journalism or even enterprise reporting process.

The first thing I want to introduce is one of my favourite apps at the moment, RStudio (source on github). This cross platform application provides a reasonably friendly environment for working with R. More interestingly, it integrates with several other applications:

  1. RStudio offers support for the git version control system. This means you can save R projects and their associated files to a local, git controlled directory, as well as managing the synchronisation of the local directory with a shared project on Github. Library support also makes it a breeze to load in R libraries directly from github.
  2. R/RStudio can pull in data from a wide variety of sources, mediated by a variety of community developed R libraries. So for example, CSV and XML files can be sourced from a local directory, or a URL; the RSQLite library provides an interface to SQLite; RJSONIO makes it easy to work with JSON files; wrappers also exist for many online APIs (twitteR for Twitter, for example, RGoogleAnalytics for Google Analytics, and so on).
  3. RStudio provides built in support for two “literate programming” style workflows. Sweave allows you to embed R scripts in LaTeX documents and then compile the documents so that they include the outputs from/results of executing the embedded scripts to a final PDF format. (So if the script produces a table of statistical results based on an analysis of an imported data set, the results table will appear in the final document. If the script is used to general a visual chart, the chart image will appear in the final document.) The raw script “source code” that is executed by Sweave can also be embedded explicitly in the final PDF, so you can see the exact script that was used to create the reported output (stats tables of results, or chart images, etc). If writing LaTeX is not really your thing, RMarkdown allows you to write Markdown scripts and again embed executable R code, along with any outputs directly derived from executing that code. Using the knitr library, the RMarkdown+embedded R code can be processed to produce an HTML output bundle (HTML page + supporting files (image files, javascript files, etc)). Note that if the R code uses something like the googleVis R library to generate interactive Google Visualisation Charts, knitr will package up the required code into the HTML bundle for you. And if you’d rather generate an HTML5 slidedeck from your Rmarkdown, there’s always Slidify (eg check out Christopher Gandrud’s course “Introduction to Social Science Data Analysis” – Slidify: Things are coming together fast, example slides and “source code”).
  4. A recent addition, RStudio now integrates with RPubs.com, which means 1-click publishing of RMarkdown/knitr’d HTML to a hosted website is possible. Presumably, it wouldn’t be too hard to extend RStudio so that publication to other online environments could be supported. (Hmm, thinks… could RStudio support publication using Github pages maybe, or something more general, such as SWORD/Atom Publishing?!) Other publication routes have also been demonstrated – for example, here’s a recipe for publishing to WordPress from R).

Oh, and did I mention that as well as running cross-platform on the desktop, RStudio can also be run as a service and accessed via a web browser. So for example, I can log into a version of RStudio running on one of OU/KMi’s server and access it through my browser…

Here’s a quick doodle of how I see some of the pieces hanging together. I had intended to work on this a little more, but I’ve just noticed the day’s nearly over, and I’m starting to flag… But as I might not get a chance to work on this any more for a few days, here it is anyway…

PS I guess I should really have written and rendered the above diagram using R, and done a bit of dogfooding by writing this post in Rmarkdown to demonstrate to the process, but I didn’t… The graph was actually rendered from a .dot source file using Graphviz. Here’s the source, so if you want to change the model, you can… (I’ve also popped the script up as a gist):

digraph G {

	subgraph cluster_1 {
		Rscript -> localDir;
		localDir -> Rscript;
		Rscript -> Sweave;
		Sweave -> TeX;
		TeX -> PDF [ label = "laTeX"]
		Rscript -> Rmarkdown;
		RCurl -> Rscript;
		Rmarkdown -> HTML [ label = "knitr" ];
		Rmarkdown -> localDir;
		Sweave -> localDir;
		label = "Local machine/\nServer";
		
		RJSONIO -> Rscript;
		XML -> Rscript;
		RSQLite -> Rscript;
		SQLite -> RSQLite;
		subgraph cluster_2 {
			XML;
			RJSONIO;
			RCurl;
			RSQLite;
			label = "Data sourcing";
		}
	}
	OnlineCSV -> RCurl;
	
	ThirdPartyAPI -> RJSONIO;
	ThirdPartyAPI -> XML;
	ThirdPartyAPI -> RCurl;
	
	
	localDir -> github [ label = "git" ];
	github -> localDir;
	HTML -> RPubs;
}

PS This is related, and very relevant – Melbourne R user group presentation: Video: knitr, R Markdown, and R Studio: Introduction to Reproducible Analysis. And this: New Tools for Reproducible Research with R

PPS See also: Data Reporting with knitr and Open Research Data Processes: KMi Crunch – Hosted RStudio Analytics Environment

Author Tony HirstPosted on July 15, 2012September 12, 2012Categories Infoskills, OU2.0, Radical Syndication, Rstats4 Comments on An R-chitecture for Reproducible Research/Reporting/Data Journalism

Course Management and Collaborative Jupyter Notebooks via SageMathCloud (now CoCalc)

Prompted by a joint coursemodule team to look at options surrounding a “virtual computing lab” to support a couple of new level 1 (first year equivalent) IT and computing courses (they should know better?!;-), I had another scout around and came across SageMathCloud, which looks at first glance to be just magical:-)

An open source, cloud hosted system [code], the free plan allows users to log in with social media credentials and create their own account space:

SageMathCloud

Once you’re in, you have a project area in which you can define different projects:

Projects_-_SageMathCloudI’m guessing that projects could be used by learners to split out different projects with a course, or perhaps use a project as the basis for a range of activities within a course.

Within a project, you have a file manager:

My_first_project_-_SageMathCloud

The file manager provides a basis for creating application-linked files; of particular interest to me is the ability to create Jupyter notebooks…

My_first_project_-_SageMathCloud2

Jupyter Notebooks

Notebook files are opened in to a tab. Multiple notebooks can be open in multiple tabs at the same time (though this may start to hit performance from the server? pandas dataframes, for example, are held in memory, and the SMC default plan could mean memory limits get hit if you try to hold too much data in memory at once?)?

My_first_project_-_SageMathCloud3

Notebooks are autosaved regularly – and a time slider that allows you to replay and revert to a particular version is available, which could be really useful for learners? (I’m not sure how this works – I don’t think it’s a standard Jupyter offering? I also imagine that the state of the underlying Python process gets dislocated from the notebook view if you revert? So cells would need to be rerun?)

My_first_project_-_SageMathCloud4

Collaboration

Several users can collaborate on a project. I created another me by creating an account using a different authentication scheme (which leads to a name clash – and I think an email clash – but SMC manages to disambiguate the different identities).

My_first_project_-_SageMathCloud5

As soon as a collaborator is added to a project, they share the project and the files associated with the project.

Projects_-_SageMathCloud_and_My_first_project_-_SageMathCloud

Live collaborative editing is also possible. If one me updates a notebook, the other me can see the changes happening – so a common notebook file is being updated by each client/user (I was typing in the browser on the right with one account, and watching the live update in the browser on the left, authenticated using a different account).

My_first_project_-_SageMathCloud_and_My_first_project_-_SageMathCloud

Real-time chatrooms can also be created and associated with a project – they look as if they might persist the chat history too?

_1__My_first_project_-_SageMathCloud_and_My_first_project_-_SageMathCloud

Courses

The SagMathCloud environment seems to have been designed by educators for educators. A project owner can create a course around a project and assign students to it.

My_first_project_-_SageMathCloud_1(It looks as if students can’t be collaborators on a project, so when I created a test course, I uncollaborated with my other me and then added my other me as a student.)

My_first_project_-_SageMathCloud_2

An course folder appears in the project area of the student’s account when they are enrolled on a course. A student can add their own files to this folder, and inspected by the course administrator.

Projects_-_SageMathCloud_and_My_first_project_-_SageMathCloud_3

A course administrator can also add one or more of their other project folders, by name, as assignment folders. When an assignment folder is added to a course and assigned to a student, the student can see that folder, and its contents, in their corresponding course folder, where they can then work on the assignment.

student_-_2015-11-24-135029_-_SageMathCloud_and_My_first_project_-_SageMathCloud

The course administrator can then collect a copy of the student’s assignment folder and its contents for grading.

My_first_project_-_SageMathCloud_9

The marker opens the folder collected from the student, marks it, and may add feedback as annotations to the notebook files, returning the marked assignment back to the student – where it appears in another “graded” folder, along with the grade.

Tony_Hirst_-_2015-11-24-135029_-_SageMathCloud_and_My_first_project_-_SageMathCloud

Summary

At first glance, I have to say I find this whole thing pretty compelling.

In an OU context, it’s easy enough imagining that we might sign up a cohort of students to a course, and then get them to add their tutor as a collaborator who can then comment – in real time – on a notebook.

A tutor might also hold a group tutorial by creating their own project and then adding their tutor group students to it as collaborators, working through a shared notebook in real time as students watch on in their own notebooks, and perhaps may direct contributions back in response to a question from the tutor.

(I don’t think there is an audio channel available within SMC, so that would have to be managed separately? [UPDATE: seems there is some audio support – via William Stein, “if you click on the chat to the right of most file types (e.g., make a .md file), then there is a video camera, and if you click on that, you can broadcast yourself to other viewers of the file”.])

Wishlist

So what else would be nice? I’ve already mentioned audio collaboration, though that’s not essential and could be easily managed by other means.

For a course like TM351, it would be nice to be able to create a composition of linked applications within a project – for example, it would be nice to be able to start a PostgreSQL or MongoDB server linked to the Jupyter server so that notebooks could interact directly with a DBMS within a project or course setting. I also note that the IPython kernel being used appears to be the 2.7 version, and wonder how easy it is to tweak the settings on the back-end, or via an administration panel somewhere, to enable other Jupyter kernels?

I also wonder how easy it would be to add in other applications that are viewable through a browser, such as OpenRefine or RStudio?

In terms of how the backend works, I wonder if the Sandstorm.io encapsulation would be useful (eg in context of Why doesn’t Sandstorm just run Docker apps?) compared to a simpler docker container model, if that indeed is what is being used?

Author Tony HirstPosted on November 24, 2015October 18, 2018Categories OU2.0, Radical Syndication, Rstats1 Comment on Course Management and Collaborative Jupyter Notebooks via SageMathCloud (now CoCalc)

Posts navigation

Previous page Page 1 Page 2 Page 3
© AJ Hirst 2008-2021
Creative Commons License
Attribution: Tony Hirst.

Contact

Email me (Tony Hirst)
Bookmarks
Presentations
Follow @psychemedia
Tracking Jupyter newsletter

Enter your email address to subscribe to this blog and receive notifications of new posts by email.

Join 2,031 other subscribers
Subscribe in a reader

My Other Blogs

F1Datajunkie Blog F1 data tinkerings
Digital Worlds Blog Game Design uncourse
Visual Gadgets Blog visualisation bits'n'pieces

Custom Search Engines

Churnalism Times - Polls (search recent polls/surveys)
Churnalism Times (search press releases)
CourseDetective UK University Degree Course Prospectuses
UK University Libraries infoskills resources
OUseful web properties search
How Do I? Instructional Video Metasearch Engine

Page Hacks

RSS for the content of this page

View posts in chronological order

@psychemedia Tweets

  • RT @timelyportfolio: @opencpu @hrbrmstr r-universe.dev opens up an amazing world of opportunities. thanks for all that you have d… 4 hours ago
  • Anyone know know how well chatgpt copes w/ translating from 19th century German to English compared to Google Translate? 10 hours ago
  • The language of resistance.. twitter.com/tomnorthfilm/s… 1 day ago
Follow @psychemedia

RSS Tumbling…

  • "So while the broadcasters (unlike the press) may have passed the test of impartiality during the..."
  • "FINDING THE STORY IN 150 MILLION ROWS OF DATA"
  • "To live entirely in public is a form of solitary confinement."
  • ICTs and Anti-Corruption: theory and examples | Tim's Blog
  • "Instead of getting more context for decisions, we would get less; instead of seeing the logic..."
  • "BBC R&D is now winding down the current UAS activity and this conference marked a key stage in..."
  • "The VC/IPO money does however distort the market, look at Amazon’s ‘profit’..."
  • "NewsReader will process news in 4 different languages when it comes in. It will extract what..."
  • Governance | The OpenSpending Blog
  • "The reality of news media is that once the documents are posted online, they lose a lot of value. A..."

Recent Posts

  • Whither In-Browser Jupyter WASM? R is Here, Could Postgres Be Too?
  • Fragment — Did You Really X That?
  • Working with Broken
  • Chat Je Pétais
  • From Packages to Transformers and Pipelines

Top Posts

  • Generating Diagrams from Text Generated by ChatGPT
  • Connecting to a Remote Jupyter Notebook Server Running on Digital Ocean from Microsoft VS Code
  • Can We Get ChatGPT to Act Like a Relational Database And Respond to SQL Queries on Provided Datasets and pandas dataframes?
  • Generating (But Not Previewing) Diagrams Using ChatGPT
  • BlockPy - Introductory Python Programming Blockly Environment
  • Displaying Differences in Jupyter Notebooks - nbdime / nbdiff
  • Can We use ChatGPT to Render Diagrams From Accessible Diagram Descriptions
  • More Than Ten Free Hosted Jupyter Notebook Environments You Can Try Right Now

Archives

OUseful.Info, the blog… Blog at WordPress.com.
  • Follow Following
    • OUseful.Info, the blog...
    • Join 2,031 other followers
    • Already have a WordPress.com account? Log in now.
    • OUseful.Info, the blog...
    • Customize
    • Follow Following
    • Sign up
    • Log in
    • Report this content
    • View site in Reader
    • Manage subscriptions
    • Collapse this bar
 

Loading Comments...