Idle Reflections on Sensemaking Around Sporting Events, Part 1: Three Phases of Sports Event Journalism

Tinkering with motorsport data again has, as is the way of these things, also got me thinking about (sports) journalism again. In particular, a portion of what I’m tinkering with relates to ideas associated with "automated journalism" (aka "robot journalism"), a topic that I haven’t been tracking so much over the last couple of years and I should probably revisit (for an previous consideration, see Notes on Robot Churnalism, Part I – Robot Writers).

But as well as that, it’s also got me thinking more widely about what sort of a thing sports journalism is, the sensemaking that goes on around it, and how automation might be used to support that sensemaking.

My current topic of interest is rallying, most notably the FIA World Rally Championship (WRC), but also rallying in more general terms, including, but not limited to, the Dakar Rally, the FIA European Rally Championship (ERC), and various British rallies that I follow, whether as a fan, spectator or marshal.

This post is the first in what I suspect will be an ad hoc series of posts following a riff on the idea of a sporting event as a crisis situation in which fans want to make sense of the event and journalists mediate, concentrate and curate information release and help to interpret the event. In an actual crisis, the public might want to make sense of an event in order to moderate their own behaviour or inform actions they should take, or they may purely be watching events unfold without any requirement to modify their behaviour.

So how does the reporting and sensemaking unfold?

Three Phases of Sports Event Journalism

I imagine that "event" journalism is a well categorised thing amongst communications and journalism researchers, and I should probably look up some scholarly references around it, but it seems to me that there are several different ways in which a sports journalist could cover a rally event and the sporting context it is situated in, such as a championship, or even a wider historical context ("best rallies ever", "career history" and so on).

In Seven characteristics defining online news formats: Towards a typology of online news and live blogs, Digital Journalism, 6(7), pp.847-868, 2018, Thorsen, E. & Jackson, D. characterise live event coverage in terms of "the vernacular interaction audiences would experience when attending a sporting event (including build-up banter, anticipation, commentary of the event, and emotive post-event analysis)".

More generally, it seems to me that there are three phases of reporting: pre-event, on-event, and post-event. And it also seems to me that each one of them has access to, and calls on, different sorts of dataset.

In the run up to an event, a journalist may want to set the championship and historical context, reviewing what has happened in the season to date, what changes might result to the championship standings, and how a manufacturer or crew have performed on the same rally in previous years; they may want to provide a technical context, in terms of recent updates to a car, or a review of how the environment may affect performance (for example, How very low ambient temperatures impact on the aero of WRC cars); or they may want to set the scene for the sporting challenge likely to be provided by the upcoming event — in the case of rallying, this is likely to include a preview of each of the stages (for example, Route preview: WRC Arctic Rally, 2021), as well as the anticipated weather! (A journalist covering an international event may also consider a wider social or political view around, or potential economic impact on, the event location or host country, but that is out-of-scope for my current consideration.)

Once the event starts, the sports journalist may move into live coverage as well as rapid analysis, and, for multi-day events, backward looking session, daily or previous day reviews and forward looking next day / later today upcoming previews. For WRC rallies, live timing gives updates to timing and results data as stages run, with split times appearing on a particular stage as they are recorded, along with current stage rankings and time gaps. Stage level timing and results data from large range of international and national rallies is more generally available, in near real-time, from the rally results database. For large international rallies, live GPS traces with update refreshes of ervy few seconds for the WRC+ live tracker map, also provide a source of near real time location data. In some cases, "champaionship predictions" will be available shwoing what the championship status would be if the event were to finish with the competitors in the current positions. One other feature of WRC and ERC events is that drivers often give a short, to-camera interviews at the end of each stage, as well as more formal "media zone" interviews after each loop. Often, the drivers or co-drivers themseleves, or their social media teams, will post social media updates, as will the official teams. Fans on-stage may also post social media footage and commentary in near real-time. The event structure also allows for review and preview opportunities througout the event. Each day of a stage rally tends to be segmented into loops, each typically of three or four stages. Loops are often repeated, typically with a service or other form of regroup, (including tyre and light fitting regroups), in-between. This means that the same stages are often run twice, although in many cases the state of the surface may have changed significantly between loops. (Gravel roads start off looking like tarmac; they end up being completely shredded, with twelve inch deep and twelve inch wide ruts carved into what looks like a black pebble beach…)

In the immediate aftermath of the event, a complete set of timing and results data will be available, along with crew and team boss interviews and updated championship standings. At this point, there is an opportunity for a quick to press event review (in Formula One, the Grand Prix + magazine is published within a few short hours of the end of the race), followed by more leisurely analysis of what happened during the event, along with counterfactual speculation about what could have happened if things had gone differently or different choices had been made, in the days following the event.

Throughout each phase, explainer articles may also be used as fillers to raise general background understanding of the sport, as well as specific understanding of the generics of the sport that may be relevant to an actual event (for example, for a winter rally, an explainer article on studded snow tyres).

Fractal Reporting and the Macroscopic View

One thing that is worth noting is that the same reporting structures may appear at different scales in a multi-day event. The review-preview-live-review model works at the overall event level, (previous event, upcoming event, on-event, review event), day level (previous event, upcoming day, on-day, review day), intra-day level (previous loop, upcoming loop, on-loop, review loop), intra-session level (previous stage, upcoming stage, on-stage, review stage) and intra-stage level (previous driver, upcoming driver, on-driver, review driver).

One of the graphical approaches I value for exploring datasets is the ability to take a macroscopic view, where you can zoom out to get an overall view of an event as well as being bale to zoom in to a particular part of the event.

My own tinkerings will rally timing and results information has the intention not only of presenting the information in a summary form as a glanceable summary, but also presenting the material in a way that supports story discovery using macroscope style tools that work at different levels.

By making certain things pictorial, a sports journalist may scan the results table for potential story points, or even story lines: what happened to driver X in stage Y? See how driver Z made steady progress from a miserable start to end up finishing well? And so on.

Rally timing and stage results review chartable.

The above chart summarises timing data at an event level, with the evolution of the rally positions tracked at the stage level. Where split times exist within a stage, a similar sort of chartable can be used to summarise evolution within a stage by tracking times at the splits level.

These "fractal" views thus provide the same sort of view over an event but at different levels of scale.

What Next?

Such are the reporting phases available to the sports journalist; but as I hope to explore in future posts, I believe there is also a potential for crossover in the research or preparation that journalists, event organisers, competitors and fans alike might indulge in, or benefit from when trying to make sense of an event.

In the next post in this series, I’ll explore in more detail some of the practices involved in each phase, and start to consider how techniques used for collaborative sensemaking and developing situational awareness in a crisis might relate to making sense of a sporting event.

Idle Reflections on Sensemaking Around Sporting Events, Part 2: The Practice of Event Reporting

In the first post of this series, Idle Reflections on Sensemaking Around Sporting Events, Part 1: Three Phases of Sports Event Journalism, I introduced a simplistic three phase model for sports event based reporting. In this post, I’ll explore in a little more detail some of the practices you might find associated with each phase.

A Note on Referencing and Quotes

The referencing style used in this series of posts is non-standard but hopefully complete. Many quotes are provided in a "raw" form and may be considered to be the equivalent of text that I might otherwise have highlighted for myself in a paper I am skim-reading. Page numbers for quotes are typically not provided: the assumption is that an electronic copy of the original paper will be available within which sentences can be searched for. Note that the order in which quotes are provided may not be the order in which they appeared in the original and quotes that appear close together in my notes may be far separated in the original.

Pre-Event Practices

One of the things I’ve been tinkering with recently are some sketches around the topic of Visualising WRC Rally Stage Routes. These started off as quick attempt at generating 3D renderings of stage routes, but quickly involved into an exploration of what I have started to term route metrics where I start to explore things like the twistiness of stage, as well as ways of viewing stages in a macroscopic way that still gives the reader a chance to make sense of the route. One trivial example of this is to chunk a route into 1km segments, perhaps annotated with additional detail, and then review each of those in turn:

Stage route chunked into 1km sections

Another approach might be to look at the route as a whole and try to identify some of the tight corners at various locations on stage. Whilst this can be done by eye, generating various references into the stage, for example using a corner index number or distance into route to identify a particular location, makes it easier to locate (literally!) to focus of interest.

Tight corners on a stage.

Finding tricky sections of a stage may give previews something to talk about in advance, and may also prime a journalist to watch out for things happening at that location. For rally fans, finding accessible and potentially interesting viewpoints in advance of going on to a stage is often essential…

In exploring route metrics for the purposes of identifying points of interest for fans, or for journalists to pick up on in their stage previews, a couple of observations jumped to my attention: firstly, crews describe stages at length in their pace notes, so by looking at how crews describe a route, I could try to automate the detection of some of those features through a static analysis of a stage route in order to draw attenion to some of the features; secondly, stage route planners on the lookout for new stages need to be able to identify routes with certain properties, and by looking at what they do I might learn about features they need to be on the look out for (safety regulations may place certain restrictions that rule out particular routes, for example).

In turn, stage planners may learn from journalists what sorts of features journalists like writing about, which can be useful when trying to gain column inches. (Rally reporting in UK national newspapers is all but non-existent…)

On-Event Practices

Whilst an event is running, being able to interpret results quickly, as well as looking for story features, is important for journalists wanting to: a) report a result; b) and also mention something interesting about how it came to be.

My own interests in sensemaking around rally events focusses on the production of artefacts that can be used to summarise events and support the identification of storypoints withing them, as well as the automated discovery and communication of "points of interest" that may be associated with the staging of the event as well as its unfolding and post-climax review.

These artefacts and points-of-interest might feed in to sensemaking around an event from a variety of audiences: fans, journalists, and even competitors. So what processes of sensemaking are already evident within these groups and how are they communicated?

The Sports Commentator or Commentary Team

For many live event sports audiences, access to live broadcast sporting events is mediated by a sports commentator or commentary team.

According to "Commentary as a Substitute for Action, Comisky, P., Bryant, J. & Zillmann, D., 1977, Journal of Communication, Volume 27(3), pp. 150–153,, "viewers in the stadium perceive the event as is, [but] home viewers are exposed to a “media event” that is the product of a team of professional gatekeepers and embellishers". In one sense, "the director is certainly an editor of the game" with their ability to "choos[e] from among the several close-ups, long shots, replays, cutaways and various segments of action at his disposal". But equally important, if not more so, is "yet another crew, the sportscasters, [who are] in charge of embellishing the drama of the affair, thereby making it more palatable to the action-hungry audience".

In stadium sports at least, Comisky et al. suggest that "[i]t is commonly assumed that sports commentary serves to compensate for imperfections of the visual modality of the medium in creating the live game: much sports commentary indeed serves this function". For example, for anyone attending a motorsport event, it is not uncommon to not see accidents that happen literally right in front of you because your gaze continually flits across a 180 degree (or more) field of view. Events can also happen so quickly that it can be hard to comprehend what actually happened. A view from a more remote distance can often provide a better perspective.

That a commentary can add value even to an audience physically present at the actual live sports event may be valuable is recognised by Comisky et al.: "the announcer’s description and analysis are often so useful that there are numerous accounts of fans in the stadium monitoring broadcast transmissions to verify that what they had just seen was really what they thought they saw".

But what of the case where the commentator only has access to the same view of the event, the same video feed as the remote audience? In that case too, the wider knowledge of the commentator may help interpret what just happened, is happening, or looks like it is about to happen.

As well as adding clarifcation, context, insight and explanation, the commentator may also play a more significant role in how the sporting event is perceived:

> The role of the contemporary sports commentator has expanded to include the responsibility of dramatizing the event, of creating suspense, sustaining tension, and enabling the viewers to feel that they have participated in an important and fiercely contested event the fate of which was determined only in the climactic closing seconds of play.

I will review the role of the commentator in creating the dramatic in the fifth post in this series.

When it comes to looking a little more deeply at the verbal craft of the commentator, one well cited reference is Sports Announcer Talk: Syntactic Aspects of Register Variation, Ferguson, C., Language in Society, 12(2), 153-172, 1983. The paper focusses on an analysis of radio sports commentary, but some of the insights are more general than that.

According to Ferguson:

> [radio] sportscasting is a monolog or dialog-on-stage directed at an unknown, unseen, heterogeneous mass audience who voluntarily choose to listen, do not see the activity being reported, and provide no feedback to the speaker. This location differentiates the register from such other related varieties as television sportscasting, the reporting of a game in progress to a blind friend, or the patter of the announcer at a circus, all of which would be included in the first approximation.

The notion of register comes from the field of sociolinguistics and refers to a "language variety viewed with respect to its context of use" (p.4 in the introduction to Sociolinguistic perspectives on register, Biber, D. and Finegan, E. (eds.), 1994).

As Ferguson describes it in his chapter Dialect, register, and genre: Working assumptions about conventionalization of the same book, a foundational principle associated with the notion of conventionalisation of register is that:

> [a] communication situation that occurs regularly in a society (in terms of participants, setting, communicative functions, and so forth) will tend over time to develop identifying markers of language structure and language use, different from the language of other communication situations.

For Ferguson, "[a]s a first approximation, sportscasting is the oral reporting of an ongoing activity, combined with provision of background information and interpretation".

> This location differentiates the presumed register from such other related varieties as the oral reporting of completed activities or the written reporting of either ongoing or completed activities.

Note that Ferguson was writing in 1983, many years before the advent of live, online text based reporting such as live blogs, which I shall review in a later section.

Specialist knowledge may also be assumed amongst the audience in the way that specialist vocabulary may be used without comment or clarification, although such specialist vocabulary does also invite the opportunity for explainers as part of the commentary:

> The radio announcer uses the technical jargon of the activity being reported, including numerous idioms and slang terms suitable for informal conversation; he also interprets events in terms of an established set of values about what constitutes good playing, moments of risk, significant points of heightened competition, players’ career goals, and the like.

That there is a distinction between live reporting and other forms of news reporting is made clear, as is the rationale for identifying live commentary as distinct from other forms of reporting:

> the very considerable amount of discourse analysis of narration is primarily focused on reporting after the fact (or fictive narration in which the events never took place) rather than on reporting events taking place at the time of the discourse, and, accordingly, the various schemata, story grammars, and the like available in the research literature are not helpful here. The two phases of discourse in this register, the announcing and the commentary, are characterized by somewhat different linguistic features and are even recognized in the folk taxonomy, as the "play-by-play" and the "color commentary."

Play-by-play (announcing) compared to "color commentary* (commentary) is a distinction that seems to appear in a lot of the literature and as such would appear to be a useful one to bear in mind when considering what sorts of communication might appear in a sports commentary or live blog.

Ferguson remarks that the audience is presumably interested not just in listening to live announcements and general commentary when tuning in to a sportscast associated with a live event, but also updates:

> The speaker assumes that members of the audience will want periodic updates on the course of the game, whether because they have just tuned in, are listening to the radio in addition to doing other things, or have simply lost track of the score or the place in the game.

Picking apart the distinction between announcing and commentary further, we learn that if there is a commentary team rather than just a single commentator, different roles may emerge:

> The allocation of the two phases (announcing and commentary) is related to the choice between monolog and dialog: if there is only one announcer he does both, but if there are two, one typically does the announcing and the other the commentary, with interesting boundary phenomena between the two roles and the two kinds of talk.

Whether a monolog or a dialog, silence may also play an important role in the commentary, although in radio at least it is often avoided:

> Incidentally, the lack of listener feedback may have been the original source of the avoidance of silence in sportscasting and other kinds of radio talk. The announcer’s maximum stretch of silence is very short and the time between moves in the reported activity must be filled with speech, the counterpart of ordinary conversation where the addressee is expected to emit signals of attention at frequent intervals. The dread of silence seems less severe in some other kinds of broadcasting, and in sportscasting it is less in some other countries.

It might be interesting to identify what might count as "silence" in an online live text medium, such as a live blog setting: what period of time needs to elapse without any journalistic contribution for the audience to percieve the libe blog as having "finished" or stalled? (Many live blogs end with a farewell notice that expplicitly signals the live blogging activity has dome to an end.)

Ferguson goes on to identify the role of the dramatic in sports commentary:

> As a third contribution to the location, sportscasting is a variety of discourse in which the level of arousal or excitement varies significantly during the discourse, and the course of this level, as well as other features of the variety is determined by quite specific bodies of knowledge and values assumed to be shared by speaker and addressees.

As suggested previously, I will explore the role of drama in storytelling around sporting events in the fifth part of this series.

Live Studio Based Commentary Webcasts

One of key features of television style commentaries is that a live video feed provides real time visual footage of the sporting event. However, another televisual mode is possible where live video footage is not available, for example due to rights issues.

An example of such a production is described in More than a hashtag: Producers’ and users’ co-creation of a loving “we” in a second screen TV sports production, Kroon, Å., 2017, Television & New Media, 18(7), pp.670-688, in the form of a webcast – PVTM — produced and distributed by the Swedish Expressen evening newspaper which "provides an alternative to the live broadcasts of the matches on regular TV with one major communicative obstacle: Expressen do not have the rights to show any audiovisual material from the matches. The setup must therefore try and attract audiences without the main component, the live match".

In the run up to the sporting event, everything is as you might expect: "[t]he pre-talk setup looks like any other broadcast equivalent where upcoming matches are discussed between the host and a guest":

> The audience is oriented to as an ordinary viewer of a football game who is positioned in a quasi-interactional way (Thompson 1995) as in most broadcast talk. They are recognized with a “welcome” and looks-to-camera from a conventionallooking studio but are not integrated as an active part of the interaction.

However, when the live sporting event begins the producers "fundamentally break with conventions during match commentary … when it comes to visual orientation in relation to the audience".

Rather than presenting a live video stream of the event, a view is presented of the host, a resident expert and a pundit, each with their own laptop on which they can see the live event footage, even though the audience can’t.

In many sports, some audience members may prefer to listen to radio commentary whilst watching a live video stream with the sound turned off, although at times there may be a mismatch between the video focus and the audio commentary focus. The PVTM model potentially replaces that mode of event consumption with one where the visual free radio commentary is replaced by a second screen view, with its own audio commentary, to complement the live feed. (This in turn suggests another model, where an radio commentary makes available a visual second screen feed that perhaps include live blog / infographic style static visual updates.)

A social media side channel on Twitter provides a means for the audience to communicate between each other and with the commentary team via a particular hashtag. Audience members are invited "to help find particular statistics or additional information relating to the ongoing game commentary". Appropriately tagged social media messages may then be referred to by the commentary team "in discourse and orient to users as co-commentators with equal authority to predict match events.

As the same audience members engage over repeated sessions, they are "often recognized as individuals with names by the studio participants, and [as] recurring contributors are welcomed back with affection".


> [s]econd screening, as it is practiced in this case, is distinctly sociable in character, and has little to do with instrumental technological usage (e.g., finding facts). The producers work at showing that they not only include users but treat them as co-fans, co-commentators, and close friends, and, by doing so, bridge the communicative distance which characterizes the quasi-interaction of ordinary broadcasting

As well as radio and visual (video) media, the internet also provides a medium for the transmission of textual content and social activity mediated by the written word, shared object, and shared link. In the next two sections, I will review the evolution of "live text broadcasts" and how they might interact with the social.

"Online Live Text Commentary"

It is easy to forget that before the year 2000, access to "the internet" was still far from the near universal status we often assume it to have today, notwithstanding the "digital divide" that still exists when accessing digital services and promoting digital reach.

A 2004 paper, Technological evolution or revolution? Sport online live internet commentary as postmodern cultural form, Sandvoss, C., 2004, Convergence, 10(3), pp.39-54, takes us back to the emergence of form that was "online live text commentary" developed by "public service and private broadcasters, national and international sporting bodies and text specialist sports websites" in which "the game event is reported in short, telegram-style sentences, summarising the key action, on occasion supported by statistics summarising the mathematically verifiable aspects of the event" that provides "[a] minimal representation of the game event through facts and figures [that] moves beyond the visual spectacle that has coined the televisual representation of sports, and thus requires modes of readership based on fan identification" .

The availability of online text commentary allowed "[those with the most intense interest in a particular object of fandom [to] shift their attention from mass media such as radio and television to niche media catering for their particular fan interest".

In terms of style, "[o]nline sports commentary draw[s] on the stylistic conventions of the presentation of sports results in newspapers and specialist sports papers" although a difference "between online commentary and the reporting of sports results in the press of course lies in newspapers’ necessarily retrospective position in comparison to the rolling nature of online live coverage". Sandvoss also notes that "[a] more direct predecessor of online sports live commentary then can be found in the use of teletext as a popular medium of sports information including live events".

In terms of how the live blogging approach (next section) has evolved in recent years, it is interesting to note how the Sandvoss saw the veracity of online sports commentary assert itself in 2004:

> To grant liveness in the absence of sound and vision the representation of live sporting events online takes place in minimal textual form reducing the event to pure, verifiable information. … In sports where the flow of the game can be even more accurately summarised in statistical terms, such as baseball, written language takes a back seat to tables and figures

As we shall see, in the live blog view of the world, uncertainty and even incorrectness is acceptable.

As to the "liveness" of the commentary:

> online text commentary, which can incur [long] delays in reporting the game event, is experienced as ’live’ as long the reader has no access to alternative, faster means of communication in his or her given space of consumption, either because the particular sporting event is not carried on television or radio within the territorial space of consumption, or because the internet is the only accessible medium in the given social space of sports consumption.

At this point then, online text commentary is seen as an alternative to radio or television live broadcasts. But how has online live text commentary evolved since those early days?

Live Blogging

To a certain extent, the rise on online and social media has broadly expanded the horizons even further for print (i.e. text-based) journalists. Live reporting is no longer limited to phoned in half time match reports or live broadcast commentaries, either televisual or radio based.

From the earliest manifestations of online text commentary, online models have continued to evolve and real-time sports reporting now draws heavily on the live blog. As with the "live studio commentary webcast", live blogs cannot guarantee that the viewer has access to live video footage of the covered event, although many viewers may have access to such a video stream.

As Simon McEnnis describes in Following the Action: How live bloggers are reimagining the professional ideology of sports journalism, Journalism Practice, 10:8, 967-982, DOI: 10.1080/17512786.2015.1068130, 2016, [l]ive blogs were initially devised for an audience without access to the live sports event as a way of providing live updates relating to an ongoing event". More recently, "content is now accessible across first-screen (television), second-screen (desktop and laptop computing) and third-screen (mobile phones and tablets) platforms" which may give the live blog a second screen role to complement actual live footage and other forms of live commentary.

According to The epistemology of live blogging, Matheson, D. and Wahl-Jorgensen, K., 2020, New Media & Society, 22(2), pp.300-316, live blogs comprise of "brief posts in reverse chronological order that may include a number of elements, from statements of news to comments curated from social media to authorial observations". This results in a "[fragmented structure], relaying information as it becomes available, rather than presenting a neatly organised news story" that "makes for an open text" whose "contingent, temporary and fragmentary coverage … is assembled somewhere between the blog editor and the reader". Importantly, the live blog "gains its coherence partly from what lies outside the text, to which it stands as a response".

Matheson & Wahl-Jorgensen further stress that "whereas the conventional news story – particularly in its print and broadcast formats – represents a finished product, the live blog emphasises the news story as an ever-evolving, incomplete process". They also note that:

> [o]ntologically, the text aligns itself with the changing character of the event, rather than standing outside it as a report. The blog has no set duration, limit to the number of posts or restriction to posting just one kind of content.

Further differences between live blogs and the perhaps more familiar traditional news article are explored in Seven characteristics defining online news formats: Towards a typology of online news and live blogs, Thorsen, E. and Jackson, D., 2018, Digital Journalism, 6(7), pp.847-868. They suggest that "[l]ive blogs are distinguished from other online news by a number of characteristics":

> The first is temporal: they are a type of news specifically designed for live and unfolding events and their format needs to reflect this liveness. Accordingly, the live blog is characterised by a series of timestamped, short updates that represent the latest development of the live event or emerging story. In most cases, these short updates convey the fluid, incomplete and unpredictable nature of the story > > The second distinguishing feature of live blogs is their tone of voice, which is often playful, light, personal and informal, acknowledging the presence of readers as participants, > > A third feature: interactivity. As they cover unfolding events, live bloggers often reach out to their audiences for eyewitness reports. > > Finally, live blogs are notable for their intertextuality and polyvocality > > Aside from multimedia, live blogs are furnished with other external materials such as direct and indirect versions of official announcements, elite quotes, reports and eyewitness accounts

Thorsen and Jackson identify live blogs as particularly relevant to politics (e.g. elections), crises and sport with their ability to act as a "key platform for breaking news and following episodic events, or to keep audiences updated on a general subject theme, … feed[ing] off and into [an] unfolding story as it happens across a range of media". I will explore some of the similarities in terms of sensemaking behaviours between crises and live sporting events in a later post.

In his analysis of interviews with several live sports bloggers, McEnnis focused on several aspects of professional ideology including objectivity/subjectivity, immediacy, public service and editorial autonomy associated with live blogs. A blend of subjective and objective reporting was recognised, and immediacy identifed as key, with "the need to update the live blog regularly within the space of minutes otherwise the audience would lose interest", setting the expectation that "the live blogger is in a state of constant production".

An emergent topic related to accuracy and correctness within a live blog, with McEnnis observing remarks to the effect that:

> the audience was accepting of mistakes in a way that would not apply to traditional media because they understood the intensive demands placed on the live blogger. The constant adding of information to the live blog also created transience where mistakes quickly belong to the past although participants would demonstrate transparency and openness in these instances. … However, it should be emphasised that participants still valued accuracy as important to their professional practice.

How then does this relate to what Matheson and Wahl-Jorgensen describe identify as "the epistemic authority of journalism – or its power to ͚define, describe and explain bounded domains of reality͛ (Gieryn, 1999: 1)" which relies on "its claims to provide a truthful account of reality"?

The Epistemology of the Live Text Commentary and the Blog

Part of my focus for this series of posts is the way in which audiences come to make sense of what is happening during an extended live sporting event, and how journalists support them in this sensemaking task.

On the one hand, one way the journalists might contribute is by being a provider of trusted information, so how might that manifest itself in a live blogging context, particularly in sitations where the audience has access to the same information as the live blogger in terms of access to the same live event feed?

In terms of trying to make sense of how mediated sports help may audiences make sense of a sporting event, Epistemologies of TV journalism: A theoretical framework", Ekström, M., 2002, Journalism, 3(3), pp.259-282, provides a framework that in part seeks to explain the epistemology of televsion journalism in very particular way, contrasting the philosophical inquiry sense of "‘epistemology’ [as] theories of the nature of knowledge and of the possibilities and the principal foundations of truth in science" with the sociological sense of epistemology as:

> the study of knowledge producing practices [in the form of] the rules, routines and institutionalized procedures that operate within a social setting and decide the form of the knowledge produced and the knowledge claims expressed (or implied)[, as well as] the question of how these claims are justified, both within the organizations and vis-a-vis the public and other social institutions.

Typically, "the epistemology of news-reporting includes strategies for dealing with potential problems relating to truth". Ideally, "[t]ruth is largely reduced to a matter of the accuracy of individual facts, …, that quotes are accurate, etc." although as we have seen this may be difficult to guarantee in a live blog setting. "Meanwhile", Ekström notes:

> [news] reporters seldom have time to do their own investigations or reflect on the reliability of various pieces of information. Nor is this expected of them. Instead, the reporter makes use of an established network of sources who deliver information that is assumed, a priori, to be justified.

In the live blog context, these sources may be news reports or updates from other "pre-justified knowledge" news sources as well as from live data feeds, such as live timing or results feeds, or items retrieved from sports results database. Many sports commentary teams have access to a statistician well versed in the history of the sport and with access to historical databases. Such specialist commentary may also be provided via social channels.

Of particular relevance to helping us understand what might make for a trusted live sporting event commentary of whatever form, are two of Ekström’s considerations:

> the production of knowledge (What rules, routines, institutionalized procedures and systems of classification guide the production of knowledge and how do journalists decide what is sufficiently true and authoritative?); and > > public acceptance of knowledge claims (What conditions are decisive for the public’s acceptance or rejection of the knowledge claims of television journalism?)

We have already seen how sports commentators may influence and shape to a very significant affect an audience’s perception of a sporting event. So in what sense are the actions of a commentator associated with communicating "truth" and to what extent do commentators operate within a jounalistic context that aims to communicate "truth"? As Ekström puts it:

> The legitimacy of journalism is intimately bound up with claims to knowledge and truth. It is thanks to its claim of being able to offer the citizenry important and reliable knowledge that journalism justifies its position as a constitutive institution in a democratic society. Knowledge claims are justified and legitimated within the framework of epistemologies.

Having sound processes and an understanding of how to manage a "flow" or "stream" of content is also important. Ekström again:

> Another important ingredient in the epistemology of news-reporting is the set of discursive techniques that the experienced news journalist has learned to use in constructing texts, all of which are designed to underline the objectivity and formally neutral position of journalism.

He goes on to give examples of several such techniques:

> Letting two people carry on a dialogue in a news item without evaluating either of them; avoiding the first person as a grammatical form in voice over and news interviews; using quotes to shift the responsibility for the truth onto someone else; these are but a few examples

before noting that "[t]he techniques are institutionalized and they are applied to deal with expectations of the news as a specific form of neutral knowledge".

Coping With Partial Information

Writing back in 2004 on the still nascent form of online live text commentary, Sandvoss observed:

> the rise of online live text commentary, then, does not mark a return to a pre-electronic age in which the written word regains the power of description and imagination, but instead of a post-visual age in which experience has largely been subsumed within the category of vision, which in its omnipresence has itself become superfluous.

For Sandvoss, it appears that because the audience member is familiar with the visual from past experience:

> online live text commentary thus constitutes the ultimate rationalisation of vision and images which have become internalised to the degree that they are no longer required in their external, mediated shape.

Fans may still refer to watch the live footage itself — "given the choice, most of us will still prefer the moving image whether displayed on a television or a computer screen to the minimal representation of the event in hypertext letters" — but notwithstanding that:

> the very occurrence of such post-visual fan consumption in online live coverage indicates the degree to which spectator sports has merged with its technologies of representation and distribution. > … > The particular mode of reading required in the consumption of online live text commentary is necessarily one of fandom and fan identification as only the reader sufficiently familiarised with its context and ultimately focused on the event’s result will be able to construct a meaningful and potentially enjoyable experience from its semiotically minimal representation.

One particular concern Sandvoss explored was the extent to which the "partial" and incomplete nature of the live online text commentary drawing on the work of two other theorists, Marshall Mcluhan and Wolfgang Iser.

> According to McLuhan, the less clearly-defined the message, the easier the consumer can make this message his or her own in a process of learning and appropriation.

This allows the reader to engage i creating their own understanding of the current state of the event from their own personal knowledge and epxectations form the partial information provide to them. In social media settings, the potentially incomplete picture presumably also invites the reader to contribute their own opinions and insights into the social channel.

The potential "multiplicity of meaning" that can be derived from the partial news feed — *"what McLuhan calls ’low definition’ broadly equals [twenty years on] the textual condition we have later [2004] come to describe as ’polysemy’ — has been:

> identified in fan and audience studies, often drawing on de Certeau’s notion of textual poaching, as the very basis of media fandom as it allows media consumers to appropriate the standardised and mass mediated popular text to their own needs, desires and understanding of self, and has hence even been ascribed an empowering and emancipatory potential.

Indeed, it seems that less may be more:

> The more minimal the semiotic content of any given mediated representation, the more polysemic it becomes. Fandom thus constitutes a practice dependent on audience activity — in Jenkins’s terms, it, ’celebrates not exceptional texts but rather exceptional readings’.

And here Sandvoss perhaps foreshadows the social contributions that help drive many of today’s live blogs:

> Precisely this emphasis on the audience’s contribution in the construction of meaning is heightened in the reading of online live text commentary. The facts and statistics presented in most condensed form are in themselves close to the pure, non-negotiable information that Eco describes in his discussion of the limits of interpretation.

For example, if you are following a live event, "it is hard to imagine more than one reading of information such as ’Team A scores’ or ’Player X is sent off’". However:

> these snippets of information still carry the implicit claim of the representation of a sports event, and only create meaning when set in relation to this absent event. The polysemic momentum of internet live text commentary thus lies in its textual absences and omissions. To give meaning to this representation the reader is required to fill the gaps left by the skeleton of minimal information with the flesh of his or her own imagination.

Thus did Sandvoss channel Mcluhan, before turning to Iser with similar intent in his attempt to reconcile the partial fragmentary nature of the live text commentary with the readers construction of their understanding of the state and dyanamics of the live event:

> According to [Wolfgang Iser’s theory of aesthetic response] the ability of readers to construct different meanings derives from gaps or blanks within texts, which require the reader to participate in the construction of meaning.

which builds on the understanding we take from the Mcluhan style treatment, before developing the idea in a related direction:

> How this takes place, however, is a question of the semiotic quality and richness of any given text. The greater the descriptive density of the text – which Iser describes as a multiplicity of schematised views – the more any given text will change the reader’s preconceptions and force a reflexive engagement with the text in order for it to be understood and eventually ’normalised’.

Again, less may be more:

> The lesser this density, as in the case of the minimal information provided to sports fans through online live commentary, the more readers have to rely on their own experiences in order to fill in such blanks. Thus the interaction between reader and text shifts from a dialogical to a self-reflective construction of meaning.

And as before, it us up to the fan to contribute their own understanding and experience from the past to construct the presumed reality of the present:

> The consumption of a given sporting event through its online representation can therefore only be verified before the background of his or her own experience, or what Jauss has described as the congruence of reader’s horizons of experience and expectation with the text in entertainment art.

The sense, indeed, the meaningfulness, of the event is further constructed by the fan for themsleves:

> It is hence not the sporting event in and for itself that matters, but only the relationship of the event’s outcome to the reader’s position to the text, in other words the reader’s identification with one of the teams or athletes involved in the event as an object of fandom, coinciding with the self-reflective nature of sports fandom.

And furthermore:

> In the absence of the actual event and its visual representation, the user’s mode of reading is thus one of fanship and pure, if elliptical identification that cannot be compared to the universal address of the visual, ’hot’ spectacle of televised sport.

And if we add a social feedback loop, then the potential arises for individual fans to feed into the co-creation of as understood not by the expert commentator, but by an informed audience who can adopt the appropriate register when engaging in the social channel. And whilst fans may identify with one competitor or another, thaey may also be expert enough to be able to put personal preferences aside when contributing or acknowledging "pure" facts, as well as informed critique, in a channel populated by individuals with competing affiliations or affinities.

Temporality and Errors

Three particular issues come to mind relating to potential errors in live text commentary, and more recently, live blogs.

Firstly, journalists (like educators) are typically not confident with realying incorrect information of being seen to be wrong (pundits, on the other hand, may relish it!) Secondly, what is the effect on the audience if they realise something is incorrect, if they are informed soemthing previously shared was incorrect, and how they correct update their own understanding in thre presence of potentially conflicting, or even incorrect, information. Third, how are errors corrected and understanding updated.

Matheson & Wahl-Jorgensen observed that "[i]n news flash updates in situations characterised by crisis and immediacy, Rom and Reich (2017: 14) found that journalists are willing to make substantially wider use of measures that lower their own voice, distance themselves from full responsibility for the published content, and minimize their knowledge claims".

The journalist thus appears to adopt an attitude of "best effort" in delivering timely updates even if the information is provisional and has not been as throughly checked as it ordinarily might be.

As Matheson & Wahl-Jorgensen put it:

> This suggests that the distinctive temporality of immediacy-oriented journalistic genres has significant consequences for the ways in which these genres present truth claims, and thereby also shape journalistic authority and voice.

In particular:

> [l]iveness produces a claim to reference the real, even if challenging the narrative textual apparatus of news journalism, by referring to a reality that is distinct from the incomplete telling.

In a typical news article:

> [n]arrative structure organises the temporality of an account, both in the sense of ordering elements in a sequence, constructing a causative structure (so that one moment arises out of preceding ones) and aligning the audience with the narrator’s telling of the story.

However, in a live blog, "fragmentation of story elements means that":

> the blog’s account of time becomes fragmented as well, leading at times to a text with multiple overlapping temporalities. This distinguishes the temporal nature of the blog’s constitution of knowledge compared to the ordered account of time characterising conventional news narratives.


> the distance of live blogging from other news textual practices means the live news blog produces distinctive journalistic claims to knowledge, particularly when it comes to claiming to know what has just happened. In live blogs, updates may contradict previous posts and a clear overall narrative of the news event becomes less likely.

Furthermore, the "live blog genre tends to place the locus of news as outside the moment of the blog text itself, in the network of texts".

This results from the form of the live blog output:

> structurally, [the live blogging process] results in texts that are frequently made up of layers, in which authority is left in originating texts, without the blog authoring them. Those originating texts must therefore have some authority.

Importantly for Matheson & Wahl-Jorgensen "[t]his dynamic form accommodates journalists making mistakes or giving partial accounts, correcting and updating."

For Ekström, an essential characteristic of journalism "is its claim to present, on a regular basis, reliable, neutral and current factual information that is important and valuable for the citizens in a democracy".

> Regularity, reliability, neutrality, currency and value are key concepts. This is not to say that all journalism lives up to these claims. In the present context I am not at all interested in that question. What I am saying is that these knowledge claims are sine qua non to the institution of journalism (at least in western democracies). And the question is: Under what conditions can these claims be accepted by the public?

Sports journalism may not be "hard news", but the trust that is developed by continually and repeatedly providing a good quality live sports blog feed presumably helps contribute to the trust levels associated with it. Again:

> Institutions that produce and communicate statements about reality on a regular basis (schools, the scientific community, applied research, advertising, courts of law, news journalism) legitimate their knowledge claims within the framework of partly distinguishing social structures and via sets of specific mechanisms.

However, news organisations must earn the trust associated with them:

> The knowledge claims of journalism are not primarily legitimated through official declarations, policy documents or other metadiscourses but through concrete texts/text constructions. > > Ultimately, it is by communicating within the framework of established genres, making use of a set of discursive and rhetorical techniques, that one can persuade the public that the news stories are neutral accounts, that the facts are facts, that reportage is truthful, that the experts are reliable, that investigative journalism is important, etc.

The institutional backing of the live blogger, and the professional processes and context within which they are operating, and the accountability to which they can be held by managing editors and audience alike also have a bearing on the trust placed in the information being shared.

Matheson & Wahl-Jorgensen again: "[we understand] the epistemology of journalism more broadly in terms of the rules, routines and institutionalized procedures that operate within a social setting and decide the form of the knowledge produced and the knowledge claims expressed (or implied)͛ (Ekström, 2002: 260, emphasis in original)".

It seems that if the platform and its process are trusted, trust in what is shared is also developed, even if the shared items are community sourced:

> audience and journalistic frames interact and compete in live blogs, wherein a space of potential co-production, journalists still reframe amateur contributions by appending their own frames onto them.

Not quite curation

At one level, it may seem as if the live blogger is acting as a curator of found and suggested resources. Matheson & Wahl-Jorgensen also suggest a slightly different perspective. Firstly, they note that "[the] layering practice [of content inclusion] operates according to the logic of passing on: it is not curation on the analogy of an art curator telling a story about a body of work.".

Notably, the authorial voice may be elusive. In the live blogs studied by Matheson & Wahl-Jorgensen, "there was minimal voicing, in the sense of either a personal or authoritative narrator, suggesting this is not a required element in the genre". Rather:

> The authorial dimension is better described as a human presence, a sense that the material has been posted by someone and is intended as of value for blog readers.

Even so, there is some semblance of the unvoiced author embodying journalistic ideals. In particular, Matheson and Wahl-Jorgensen introduce the notion of networked balance:

> In the live blog, the journalist moves away from the objectivity of conventional news stories to what we refer to as ͚networked balance, …, a practice where balance is sought amongst the available perspectives on the news event and distance is maintained through the [layered] practice of offering.

What other processes might breed trust in this form of journalism, and how might they relate to live blogging around sports events? According to Ekström:

> A central idea within the sociology of knowledge is that social practices include and reproduce classifications of reality. Social practices are classification activities. Individuals orient themselves in the world around them by means of collective, deeply rooted, but not immutable distinctions.

It is important to note that "[c]lassifications are never neutral". Rather:

> They are normative; they include assumptions about what is good and bad and how people should act in different situations. Classifications have consequences; they make a difference. Classifications serve various social functions: they allow people to meet and master concrete situations; and they reproduce and legitimize the social order (Douglas, 1986).

In the live blog tradition, where the live blogger assembles a range of resources from potentially diverse sources, Ekström would presumably see classification by a trusted agent of a trusted organsition as playing an important role:

> As a knowledge-producing institution, journalism bears a dual relationship to such classification activities. First, journalism actively contributes to producing, reproducing and naturalizing collective conceptions of reality Second, journalistic work is based on classifications that serve more or less as tacit points of departure for the production of knowledge.

So for Ekström, what sorts of classification might be involved?

> The epistemologies of journalism include many different (constant or variable) classifications. For example, there is the classification of news sources, whereby what some sources say needs verification but others not. > > Another example is the classification of simplifications and dramatizations in headlines and presentations; some are acceptable, others not. > > Yet another classification determines who may be referred to as an ‘expert’.

To a certain extent, live blogging can be seen as curation and embedded resharing of "found" resources, as well as insertion of templated "points of interest" into the live blog stream. As the form has evolved, Matheson & Wahl Jorgensen suggest that live bloggers now "need to be able to multi-task because of the multimedia nature of live blogs. Participants described their sources as being snippets of television and radio commentary, tweets, reader e-mails and texts, YouTube clips, statistics and photographs". It is also a creative process where editorial autonomy is essential:

> [t]he autonomy afforded to live blogging beyond stylistic issues was considered to have emerged from a time when news executives’ focus was still on traditional platforms such as newspaper and television. Participants perceived live blogging to be agency led. They pointed to minor influence from senior managers providing the core live blogging ideologies of providing regular updates and a multimedia and interactive experience were met.

Notwithstanding Thorsen & Jackson’s observation that "[s]ports live blogs are distinct from all other genres and formats, with a narrative constructed around a conversational vernacular that also uniquely includes a large proportion of audience contributions as commentary within the main text", are audience contributions a necessary component of a live blog associated with a sporting event?

Whilst live blogs certainly can include found resources, I think it is arguable whether a blog must support social commentary or embed found social objects or contributions, or whether the earlier, simpler approach to online live commentary is still viable. For example, in a live rally blog, a graphic showing tyre selection amongst the current front runners, or the gaps between the first and second and second and third placed drivers, can be: a) predicted as a point of interest that can be dropped into a feed at an opportune moment; b) generated with from current data on demand.)

Rather than necessarily incorporating a conversational component in the form of embedding found social objects and/or acknowledging of recent social comment, a live blog could just be a live news feed in the sense of earlier live online txt commentaries, publishing author originated content and on demand author created/embedded objects; however, the apparent lack of a two way conversational channel may make the audience less accepting of both errors and subjectivity.

For some audiences, however, the more "factual", fewer voices feed may be more valuable.

The Social Aspect of Live Blogging

We have already seen how live blogs may in part be constructed around social sourced objects or commentary.

As Thorsen and Jackson remarked on participatory communities around live blogs, "sport was the one genre where demotic ["everyday", vernacular] voices were allowed into live blogs as commentators and (amateur) pundits, allowing some citizens the ability to offer normative frames to the unfolding events". They further added:

> [s]ports blogs recreate a distinctive spectator experience, and incorporates episodic publics into this space. In doing so it seeks to replicate the type of vernacular interaction audiences would experience when attending a sporting event (including build-up banter, anticipation, commentary of the event, and emotive post-event analysis). We define this episodic public as “spectator-inclusion”, which has a highly community-driven component.

For Matheson and Wahl-Jorgensen, "live blogs can be characterised as open communication, helping the reader follow developing news and multiple threads of information and discussion". In terms of construction, they suggest "[t]he narrative structure of live blogs is characterised by the fragmentation of the text and its lack of textual coherence, which serve to open the text but also weaken its claims to knowledge. The overlapping temporality of the live blog allows it to represent moments in time". But perhaps more importantly:

> [The live blog] challenges power relations constituted by conventional journalistic texts, allowing for layered texts made up of multiple voices. It therefore holds the potential to destabilise the ways in which power relations structure the constitution of knowledge. The role of the journalist is decentred as a curator; the text circulates rather than performs the news; and the boundary between front stage and back stage weakens as the authorial stance aligns more with the audience through the practice of networked balance and the construction of co-presence.

That a live blog may continue to be publicly available following the conclusion of the live event it was reporting on, albeint under the tacit understanding that it was produced as a live work, also colours how we might regard it after the event: "[a]bove all, the status of the live blog as a text differs from much news practice in enabling temporary, contingent sites of publicness to emerge, which may endure only for the time of the event being reported on".

Influencing the Event

Finally, in terms of power relations, it is worth noting that in some circumstances, journalists and commentators may even been seen as participating in the event itself.

Crews will be looking for certain sorts of information, so knowing what stories crews as looking for in the timing data, and how they find them, may also provide a useful lens for reporters looking for an interesting angle (a particular driver’s angle, for example).

Competitively, crews and teams may be looking at timing data with a view to developing strategy — or tactics… Knowing how the teams and crews read the data and the sorts of trends they are looking for can also provide journalists and fans alike with an opportunity to play along with the strategy game.

By summarising and reviewing information in a timely way, that informtion may be consumed by actual competitors in the midst of competing within the event.

One example of this is in live stage video feeds broadcast, with commentary, via the WRC+ subscription service. Thse feeds may be viewed by competitors who have not yet entered the stage. (Whether they watch the footage with the "expert" commentary muted, I don’t know!)

Another example might be competitors referring to timing summaries that present information in a different, perhaps value adding, way to the web-accessible official live timing screens.

A couple of chart types I picked up on by cribbing from, and in conversation with, @WRCStan focus on the idea of pace on a stage.

For example, the following chart shows a pace map that depicts differences in pace between a specified driver and each other driver, by stage, along with a colour highlighted comparison with another specified driver:

Pace map showing pace delta between a specified driver and each other driver, by stage, and a comparison with another particular driver highlighted.

A slightly different approach is to look at the accumulated time delta between a specified driver and the other drivers. The stage width sizing is proportional to stage ditance, and so the gradient (i.e. the slope) in each stage gives an impression of pace difference.

Accumulated time delta and stage pace comparison for a rebased driver.

From Live Summary to News Article

It has been claimed (I forget where…) that "if journalism is the first draft of history, the live blog is the first draft of journalism".

So by what processes might live sports event news go from live blog to news article and what differences might there be between the two forms?

Thorsen and Jackson observed that:

> Live blogs are prone to using sources of information from other new media as the story unfolds, which are then replaced by the news organisation’s own source once the news article is written up. However, our evidence shows that this reliance on other media sources for facts in live blogs, does not translate to a similar dominance of remediation of elite political or elite sports sources in the same format – with live blogs instead sourcing directly from social media


> [T]he use of social media as a type of content within online news was format-dependant, not genre-dependant. That is, social media was used frequently across all live blogs, but infrequently or not at all in news articles.

Matheson & Wahl-Jorgensen also commented on the differences betweem live blogs and news articles:

> the live blog being less textually coherent than other kinds of news text and instead cohering largely in terms of the moments it signifies. A conventional news text coheres textually in terms of its intro or lead. What makes the paragraphs a coherent text rather than a list of fact claims is the centre-and-satellite logic governing them, whereby the opening sentence summarises the story in terms of the most newsworthy element and other paragraphs support and fill out its claims (White, 2000) – a narrative structure commonly referred to as the ͚inverted pyramid͛ lead.

The following table, taken from Matheson & Wahl-Jorgensen, usefully summarises a typology of the epistemology of the live blog compared to the news stor:

Taken without permission from “The epistemology of live blogging”, Matheson, D. and Wahl-Jorgensen, 2020.

In going from the live event, and live commentaries on the event, to more structured news articles, a certain narrative or dramatic structire may already have been imposed upon the event from the live coverage that can be hard to escape from. As Ekström notes:

> news journalists can be rather cavalier in constructing the news story and sensational events, without critically reflecting on how true or accurate the news story as a whole may be. There is a strong tendency to overlook the influence journalists exert over the meanings created when facts are incorporated in given text constructions

Post-Event Practices

Following an event, the full set of results are in and the outcome for changes to championship standings are known. A more leisurely take on analysis, as well as whole event reviews, often appear in the few days following an event.

In some cases, it may be that a narrative intitially imposed on the event has become contested, or even denied, and as such post-event analysis provides an opportunity to present a revised narrative against the original one.

What Next?

In the next post in this series I will consider some of the ways in which sensemaking and reporting around sporting events may be likened to sensemaking and news behaviours associated with handling crises.

It Starts With a Wondering: Hmm, How Would I Do That?

Via @MSportLtd on the twitterz:

Amazing stages, but this hasn’t been our weekend. Having analysed the data, our times are strong on the more technical sections and when the conditions are tricky we’re right up there – so that’s a positive to focus on

Which makes me wonder: hmm, how would I ascertain that?

From the linked blog post:

Showing their speed through the narrow, more technical sections, both crews drove well and delivered some competitive split times. But the high-speed sections that characterised much of the day didn’t fully suit the Fiesta.

When the conditions were at their trickiest, the Fiestas were able to demonstrate their pace and ability to challenge at the sharp end – Suninen setting the fourth fastest time despite a spin through a challenging second pass of Mustalampi (SS6).

So that’s something else to add to my ideas-to-play-with list: how would I: a) verify that, and b) detect it…?

PS here’s another, from @AnttiL_WRC…

#ArcticRallyFinland SS4 Kaihuavaara had a very fast section which @ElfynEvans took 4:20 to complete with average speed of 154 km/h (timed from onboards). For @TeemuSuninenRac it was 12 seconds slower. During the remaining twisty sections of the stage he only lost ~1 s to Evans.

Thinks: Symbolic Dynamics for Categorising Rally Stage Wiggliness?

Many years ago, i had the privilege of attending a month long complex systems summer school organised by the Santa Fe Institute. One of the lecture series presented was by Michael Jordan and from it I remember a couple of really werful concepts, if not the detail. One was the Bayes Ball, and the other was symbolic dynamics.

I’ve briefly tinkered with a very simple symbolic dynamics representatio before in an attempt to come up with signatures for identifying different sorts of simple dynamics for summarisingdriver’s performance in rally stages (e.g. Detecting Features in Data Using Symbolic Coding and Regular Expression Pattern Matching) and I’ve started wondering again about whether the approach might also be useful in trying to capture something of the wiggliness of rally stage routes.

To this end, the following quote looks relevant, even if it does come from a paper on heart rate dynamics in rats:

Symbolic Dynamics

The symbolic dynamics method, proposed by Porta, aims to convert the CI and SAP series in a sequence of symbols and evaluates the dynamics of each three consecutive symbols (words). First, a procedure known as uniform quantization is applied to the CI or SAP series, where the full range of values is divided into six equal levels. Each quantization level is represented by a symbol (0 to 5) and all points within the same level will be assigned the same symbol. Next, sequences of three consecutive symbols (words) are evaluated and classified according to its variation pattern: zero variation (0V), one variation (1V), two like variations (2LV) or two unlike variations (2UV).

The 0V family comprises words where there is no variation between symbols, i.e., all symbols are equal. The sequences {0,0,0} and {3,3,3} are examples of sequences from this class. The 1V family represents words that have only one variation from one symbol to another, i.e. sequences with two consecutive equal symbols and one different. Examples of sequences of this family are {5,2,2} and {0,0,1}. The 2LV family is composed of words containing three different symbols but with the same variations direction, i.e. in ascending or descending order. Examples of sequences of this family are {1,2,5} and {3,2,1}. Lastly, 2UV family comprises sequences that form a peak or a valley, i.e. with two different variations, in opposite directions. The sequences {2,4,2} and {3,0,1} are examples of this family.

Once this classification is made for the entire series, the percentage of patterns classified in each family is used for analysis.

Silva, L.E.V., Geraldini, V.R., de Oliveira, B.P. et al. Comparison between spectral analysis and symbolic dynamics for heart rate variability analysis in the rat. Sci Rep 7, 8428 (2017).

So, something to play with there: three tuple sequences and the changes within them, which could perhaps be useful for identifying right-left-right / left-right-left sections in a route etc. Hmm…

Thinks Another: Using Spectrograms to Identify Stage Wiggliness?

Last night I started wondering about ways in which I might be able to use signal processing (Fourier analysis) or symbol dynamics (eg Thinks: Symbolic Dynamics for Categorising Rally Stage Wiggliness?) to help categorise the nature of rally stage twistiness.

Over a morning coffee break, I reminded myself of spectrograms, graphical devices that chunk a time series into a sequence of steps, and than display a frequency plot of each part. Which got me wondering: could I use a spectrogram to segment a stage route and analyse the spectrum of some signal taken along the route to identify wiggliness at that part of the stage?

If I’m reading it right [I wasn’t… the distances were wrong for a start: note to self – check the default parameter settings!], I think the following spectrogram does show some possible differences in wiggliness for different segments along the stage?


The question then becomes: what signal (as a function of distance along line) to use? The above spectrogram is based on the perpendicular distance of the route from the straight line connecting the start and end points of the route.

# trj is a trajr route
straight = st_linestring(data.matrix(rbind(head(trj[,c('x','y')], 1),
                                           tail(trj[,c('x','y')], 1))))

straight_sf = st_sfc(straight,

trj_d = TrajRediscretize(trj, 10)
utm_discretised = trj_d %>% 
                    sf::st_as_sf(coords = c("x","y")) %>% 

# Get the rectified distance from the midline
# Can we also get whether it's to left or right?
perp_distances = data.frame(d_ = st_distance(utm_discretised,
# Returned distance is given as units
perp_distances$d = as.integer(perp_distances$d_)

perp_distances$i = 10 * (1:nrow(perp_distances))
#perp_distances$i = units::set_units(10 * (1:nrow(perp_distances)), 'm')

We can then do something like a low pass filter:


# High pass filter
bf <- butter(2, 0.9, type="high")
perp_distances$d_hi <- filter(bf, perp_distances$d)

and generate the spectrogram show above:

# We could just plot this direct
spec = specgram(perp_distances$d_hi)

# Or make pretty
# Via:
# discard phase information
P = abs(spec$S)

# normalize
P = P/max(P)

# convert to dB
P = 10*log10(P)

# config time axis
t = spec$t

# plot spectrogram
imagep(x = t,
       y = spec$f,
       z = t(P),
       col = oce.colorsViridis,
       ylab = 'Frequency [Hz]',
       xlab = 'Time [s]',
       drawPalette = T,
       decimate = F

However, it would possibly make more sense to use something line the angle of turn, convexity index, or radius of curvature at each 10m step as the signal…


Related: Rapid ipywidgets Prototyping Using Third Party Javascript Packages in Jupyter Notebooks With jp_proxy_widget (example of a waversurfer.js spectrogram js app widgetised for use in Jupyter notebooks).

If you listen to that track it’s really interesting seeing how the imagery maps onto the sound. Eg in the above image you can see a lag in an edge between right and left channels towards the end of the trace, which translates to hearing an effect in the left channel echoed a moment later in the right.

Which makes me think: could I use telemetry from two drivers as left and right stereo tracks and try to sonify the telemetry differences between them using distance along stage as the x axis value and some mapping of different telemetry channels onto frequency…? For example, brake on the bass, throttle at the top, and lateral acceleration in the mid-range?

The Analytics Trap

Via the twitterz, a caution: is data analysis bad for rallying?

I’ve been tinkering with bits and bobs or rally related data for the last few weeks, and I can see how easy it might be for teams and sports data analysts to fall into the trap of always looking to the data to find fixes for a poor performance or a poor result, and perhaps losing sight of the human element and challenge that provides the basis of any sport.

The particular context I’m interested in is using data to support human storytelling, or explanation around, what happened in particular event. A part of this might involve using natural language generation (“data to text”) strategies to generate summaries. But my intention would not be to automate out human reporters. It would be to provide possible storypoints or observations much in sthe same way a not totally reliable or judicious witness might feed observations in that may, or may not, be useful to a journalist creating a report.

Similarly, any automatically generated reports or commentary (such as it might be) might also be views by fans as an enthusiastic, well meaning, or even pub bore level champion of “the data”: fine, in moderation, and perhaps okay to spend some time with for a bit of inside baseball level of (potentially misunderstood) geekiness, but not a replacement for a well crafted piece of sportswriting from a specialist sports journalist.

Even the visual reports I produce are not intended to mean anything in and of themselves. They are throwaway sketches over the data intended as visual cribs that provide a stylised macroscopic view of a large amount of data to help the reader spot stories that might hinted at by the data and act as a starting point for a more considered human level interpretation.

“Because the data…” is not what I’m looking for; “ooh, does that mean…?” and then looking for corroboration elesewhere, and at a far more human(e) level of interpretation, is much more what I’m after…

Punk is an aesthetic I never really subscribed to…

… the mohicans, the fashion sense and the apparently nihilistic attitude, the appearance of potentially looming violence, the drugs of preference — needles have no place in recreation other than knitting (?!) — and that particular subculture…

…which isn’t to say I didn’t know folk who other people classed as punks, but we were not quite goths, not quite ‘evvy metal, sort of grebo, not quite hippy, not quite rock and not quite blues, not crusties, not travellers (hitchers, yes, most definitely), and definitely not ravers (though some probably were).

Three or four years ago I started hitting the road again, trekking round the country following Hands Off Gretel, who tend towards the punk aesthetic with dayglo overtones, but the tunes I like are the poppy ones of Nevermind era Nirvana, corssed with Pink, and voice to match in both respects.

Over the last few years, I’d seen Ferocious Dog t-shirts, hoodies, caps and more getting ever more prevalent at the festivals we frequent, but from the look of the stage photos they were “punk” so not my thing…

…till I heard them, of course, and the folk punk rock melodies and social political nature, the family feel of an FD gig and the merch you can’t not get a habit for once you get the habit means they are hugely habit forming…

…and despite the lockdown and the many tickets to gigs that keep getting rolled back, we had Thosdis to look forward to, and Red Ken’s lockdown sessions (what could possibly go wrong…) and amongst the classics (Ken only plays classics), some new bands to me I’d not really heard before, done as solo acoustics, from punk named bands but with melodies and rhythms to die for…

…so enter Rancid and Social Distortion to my regular listening mix….

…and a thought that maybe, maybe, I need to start listening a bit more widely to the punk rock back catalogue, because there are some cracking tunes out there, and even the aesthetic isn;t your thing, the melodies may be…

… and some fantastically singalong-a-lyrics, particularly in the choruses…

Punk? Not me, not never, ever… But maybe, maybe, I need to rethink what I thought I thought I thought I understood by punk rock.

FWIW, I always thought of the Sex Pistols as a rock band (as least as far as Bollocks goes…); and Green Day; and Dog’s d’Amour (whom Social Distortion keep reminding me of….). And Iggy & the Stooges; and The Ramones. And the mother of all rock and roll bands: Motörhead.

From Visual Impressions to Visual Opinions

In The Analytics Trap I scribbled some notes on how I like using data not as a source of "truth", but as a lens, or a perspective, from a particular viewpoint.

One idea I’ve increasingly noticed being talked about explcitly across various software projects I follow is the idea of opionated software and opionated design.

According to the Basecamp bible, Getting Real, [th]e best software takes sides. … [Apps should] have an attitude. This seems to lie at the heart of opinionated design.

A blog post from 2015, The Rise of Opinionated Software presents a widely shared definition: Opinionated Software is a software product that believes a certain way of approaching a business process is inherently better and provides software crafted around that approach. Other widely shared views relate to software design: opinonated software should have "a view" on how things are done and should enforce that view.

So this idea of opinion is perhaps one we can riff on.

I’ve been playing with data for years, and one of things I’ve believed, throughout, in my opinionated way, is that its an unreliable and opinionated witness.

In the liminal space between wake and sleep this morning, I started wondering about how visualisations in particular could range from providing visual impressions to visual opinions.

For example, here’s a view of a rally stage, overlaid onto a map:

This sort of thing is widely recongnisable to anyboy had use an online map, and anyone who has seen a printed map and drawn a route on it.

Example interactive map view

Here’s a visual impression of just the route:

View of route

Even this view is opinionated because the co-ordinates are projected to a particular co-ordinate system, albeit the one we are most familiar with when viewing online maps; but other projections are available.

Now here’s a more opinionated view of the route, with it cut into approximuately 1km segments:

Or the chart can express an opinion about where it things significant left and right hand corners are:

The following view has strong opinions about how to display each kilometer section: not only does it make claims about where it things significant right and left corners are, it also rotates each segment to so the start and end point of the section lay on the same horixontal line:

Another viewpoint brings in another dimension: elevation. It also transforms the flat 2D co-ordinates of each point along the route to a 1-D distance-along-route measure allowing us to plot the elevation against a 1-D representation of the route in a 2D (1D!) line chart.

Again, the chart expresses an opinion about where the significant right and left corners are. The chart also chooses not to be more helpful than it could be: if vertical grid lines corresponded to the start and end distace-into-stage values for the segmented plots, it would be easier to see how this chart relates to the 1km segmented sections.

At this point, you may say that the points are "facts" from the data, but again, they really aren’t. There are various ways of trying to define the intensity of a turn, and there may be various ways of calculating any particular measure that give slightly differnent results. Many definitions rely on particular parameter settings (for example, if you measure radius of curvature from three points on a route, how far should those points be apart? 1m? 10m? 20m? 50m?

The "result" is only a "fact" insofar as it represents the output of a particular calculation of a particular measure using a particular set of parameters, things that are typically not disclosed in chart labels, often aren’t mentioned in chart captions, and may or may not be disclosed in the surrounding text.

On the surface, the chart is simply expressing an opion about how tight any of the particular corners are. If we take it a face value, and trust its opinion is based on reasonable foundations, then we can accept (or not accept) the chart’s opinion aabout where the significant turns are.

If we were really motivated to understand the chart’s opinion further, if we had access to the code that generated it we could start to probe its definition of "significnant curvature" to see if we agree with the principles on which the chart has based its opinion. But in most cases, we don’t do that. We take the chart for what it is, typically accept it for what it appears to say, and ascribe some sort of truth to it.

But at the end of the day, it’s just an opinion.

The charts were generated using R based on ideas inspired by Visualising WRC Rally Stages With rayshader and R [repo].

When Less is More: Data Tables That Make a Difference

In the previous post, From Visual Impressions to Visual Opinions, I gave various examples of charts that express opinions. In this post, I’ll share a few examples of how we can take a simple data table and derive multiple views from it that each provide a different take on the same story (or does that mean, tells different stories from the same set of "facts"?)

Here’s the original, base table, showing the recorded split times from a single rally stage. The time is the accumulated stage time to each split point (i.e. the elapsed stage time you see for a driver as they reach each split point):

From this, we immediately note the ordering (more on this in another post) which seems not useful. It is, in fact, the road order (i.e. the order in which each driver started the stage).

We also note that the final split is not the actual final stage time: the final split in this case was a kilometer or so before the stage end. So from the table, we can’t actually determine who won the stage.

Making a Difference

The times presented are the actual split times. But one thing we may be more interested in is the differences to see how far ahead or behind one driver another driver was at a particular point. We can subtract one driver’s time from anothers to find this difference. For example, how did the times at each split compare to first on road Ogier’s (OGI)?

Note that we can “rebase” the table relative to any driver by subtracting the required driver’s row from every other row in the original table.

From this “rebased” table, which has fewer digits (less ink) in it than the original, we can perhaps more easily see who was in the lead at each split, specifically, the person with the minimum relative time. The minimum value is trivially the most negative value in a column (i.e. at each split), or, if there are no negative values, the minimum zero value.

As well a subtracting one row from every other row to find the differences realative to a specified driver, we can also subtract the first column from the second, the second from the third etc to find the time it took to get from one split point to the next (we subtract 0 from the first split point time since the elapsed time into stage at the start of the stage is 0 seconds).

The above table shows the time taken to traverse the distance from one split point to the next; the extra split_N column is based on the final stage time. Once again, we could subtract one row from all the other rows to rebase these times relative to a particular driver to see the difference in time it took each driver to traverse a split section, relative to a specified driver.

As well as rebasing relative to an actual driver, we can also rebase relative to variously defined “ultimate” drivers. For example, if we find the minimum of each of the “split traverse” table columns, we create a dummy driver whose split section times represent the ultimate quickest times taken to get from one split to the next. We can then subtract this dumny row from every row of the split section times table:

In this case, the 0 in the first split tells us who got to the first split first, but then we lose information (withiut further calculation) about anything other than relative performance on each split section traverse. Zeroes in the other columns tell us who completed that particular split section traverse in the quickest time.

Another class of ultimate time dummy driver is the accumulated ultimate section time driver. That is, take the ultimate split sections then find the cumulative sum of them. These times then represent the dummy elapsed stage times of an ultimate driver who completed each split in the fastest split section time. If we rebase against that dummy driver:

In this case, there may be only a single 0, specifically at the first split.

A third possible ultimate dummy driver is the one who “as if” recorded the minimum actual elapsed time at each split. Again, we can rebase according to that driver:

In this case, will be at least one zero in each column (for the driver who recorded that particular elapsed time at each split).

Visualising the Difference

Viewing the above tables as purely numerical tables is fine as far as it goes, but we can also add visual cues to help us spot patterns, and different stories, more readily.

For example, looking at times rebased to the ultimate split section dummy driver, we get the following:

We see that SOL was flying from the second split onwards, getting from one split to another in pretty much the fastest time after a relatively poor start.

The variation in columns may also have something interesting to say. SOL somehow made time against pretty much every between split 4 and 5, but in the other sections (apart from the short last section to finish), there is quite a lot of variability. Checking this view against a split sectioned route map might help us understand whether there were particular features of the route that might explain these differences.

How about if we visualise the accumulated ultimate split section time dummy driver?

Here, we see that TAN was recording the best time compared the ultimate time as calculated against the sum of best split section times, but was still off the ultimate pace: it was his first split that made the difference.

How about if we rebase against the dummy driver that represents the driver with the fastest actual recorded accumulated time at each split:

Here, we see that TAN led the stage at each split point based on actual accumulated time.

Remember, all these stories were available in the original data table, but sometimes it takes a bit of differencing to see them clearly…

Spellchecking Jupyter Notebooks with pyspelling

One of the things I failed to do at the end of last year was put together a spellchecking pipeline to try to pick up typos across several dozen Jupyter notebooks used as course materials.

I’d bookmarked pyspelling as a possible solution, but didn’t have the drive to do anything with it.

So with a need to try to correct typos for the next presentation (some students on the last presentation posted about typos but didn’t actually point out where they thought were so we could fix them) I thought I’d have a look at whether pyspelling could actual help having spotted a Github spellcheck action — rojopolis/spellcheck-github-actions — that reminded me of it (and that also happens to use pyspelling).

The pyspelling package uses a matrix and pipeline ideas. The matrix lets you define and run separate pipelines, the pipelines let you sequence a series of filter steps. Available filters include markdown, html and python filters that preprocess files and pass text elements for spellchecking to the spellchecker. The Python filter allows you to extract things like comments and docstrings and run spell checks over those; the markdown and HTML filters can work together so you can transform markdown to HTML, then ignore the content of code, pre and tt tags, for example, and spell check the rest of the content. A url filter lets you remove URLs before spellchecking.

By default, there is no Jupyter notebook / ipynb filter, so I started off by running the spellchecker against Jupytext markdown files generated from my notebooks. A filter to strip out the YAML header at the start of the jupytext-md file was there to help minimise false positive typos from the spell checker report.

In passing, I often use a Jupytext -pre-commit filter to commit a markdown version of Git committed notebooks to a hidden .md directory. For example, in .git/hooks/pre-commit, add the line: jupytext –from ipynb –to .md//markdown –pre-commit [docs]. Whenever you commit a notebook, a Jupytext markdown version of the notebook (ex- of the code cell output content) will also be added and commited into a .md hidden directory in the same directory as the notebook.

Here’s the first attempt a pyspelling config file:

# -- .pyspelling.yml --

- name: Markdown
    lang: en
    - .wordlist.txt
    encoding: utf-8
  - pyspelling.filters.context:
      # Cribbed from pyspelling docs
      context_visible_first: true
      # Ignore YAML at the top of juptext-md file
      # (but may also exclude other content?)
        - open: '(?s)^(?P<open> *-{3,})$'
          close: '^(?P=open)$'
  - pyspelling.filters.url:
  - pyspelling.filters.markdown:
        - pymdownx.superfences:
  - pyspelling.filters.html:
      comments: false
        - code
        - pre
        - tt
    - '**/.md/*.md'
  default_encoding: utf-8

Note that the config also includes a reference to a custom wrodlist in .wordlist.txt that includes additional whitelist terms over the default dictionary.

Running pyspelling using the above confguration runs the spell checker over the desired files in the desired way: pyspelling > typos.txt

The output typos.txt file then has the form:

Misspelled words:
<htmlcontent> content/02. Getting started with robot and Python programming/02.1 Robot programming constructs.ipynb: html>body>p

Misspelled words:
<htmlcontent> content/02. Getting started with robot and Python programming/02.1 Robot programming constructs.ipynb: html>body>p

We can create a simple pandas script to parse the result and generate a report that counts the prevalence of particular typos. For example, something of the form:

datalog          37
dataset          32
pre              31
convolutional    19
RGB              17
pathologies       1
Microsfot         1

One possible way of using that information is to identify terms that maybe aren’t in the dictionary but should be added to the whitelist. Another way of using that infomation might be to identify jargon or potential glossary terms. Reverse ordering the list is more likely to give you occasional typos; middling prevalence items might be common typos; and so on.

That recipe works okay, and could be used to support spell checking over a wide range of literate programming file formats (Jupyter notebooks, Rmd, various structured Python and markdown formats, for example). Basing the process around a format Jupytext exports into allows us to then have a Jupytext step at the front a small pieces lightly joined text file pipeline that takes a literate programming document, converts it to eg Jupytext-md, and then passes it to the pyspelling pipeline.

But a problem with that approach is that we are throwing away perfectly good structure in the orginal document. One of the nice things about the ipynb JSON format is that it separates code and markdown in a very clean way (and by so doing makes things like my innovationOUtside/nb_quality_profile notebook quality profiler relatively easy to put together). So can we create our own ipynb filter for pyspelling?

Cribbing the markdown filter definition, it was quite straightforward to hack a first pass attempt at an ipynb filter that lets you extract the content of code or markdown cells into the spell checking pipeline:

# -- --

"""Jupyter ipynb document format filter."""

from .. import filters
import codecs
import markdown
import nbformat

class IpynbFilter(filters.Filter):
    """Spellchecking Jupyter notebook ipynb cells."""

    def __init__(self, options, default_encoding='utf-8'):

        super().__init__(options, default_encoding)

    def get_default_config(self):
        """Get default configuration."""

        return {
            'cell_type': 'markdown', # Cell type to filter
            'language': '', # This is the code language for the notebook
            # Optionally specify whether code cell outputs should be spell checked
            'output': False, # TO DO
            # Allow tagged cells to be excluded
            'tags-exclude': ['code-fails']

    def setup(self):

        self.cell_type = self.config['cell_type'] if self.config['cell_type'] in ['markdown', 'code'] else 'markdown'
        self.language = self.config['language'].upper()
        self.tags_exclude = set(self.config['tags-exclude'])

    def filter(self, source_file, encoding):  # noqa A001
        """Parse ipynb file."""

        nb =, as_version=4)
        self.lang = nb.metadata['language_info']['name'].upper() if 'language_info' in nb.metadata else None
        # Allow possibility to ignore code cells if language is set and
        # does not match parameter specified language? E.g. in extreme case:
        #if self.cell_type=='code' and self.config['language'] and self.config['language']!=self.lang:
        #    nb=nbformat.v4.new_notebook()
        # Or maybe better to just exclude code cells and retain other cells?

        encoding = 'utf-8'

        return [filters.SourceText(self._filter(nb), source_file, encoding, 'ipynb')]

    def _filter(self, nb):
        """Filter ipynb."""

        text_list = []
        for cell in nb.cells:
            if 'tags' in cell['metadata'] and \
            if cell['cell_type']==self.cell_type:
        return '\n'.join(text_list)

    def sfilter(self, source):

        return [filters.SourceText(self._filter(source.text), source.context, source.encoding, 'ipynb')]

def get_plugin():
    """Return the filter."""

    return IpynbFilter

We can then create a config file to run a couple of matrix pipelines: one over ntoebook markdown cells, one over code cells:

# -- ipyspell.yml --

- name: Markdown
    lang: en
    - .wordlist.txt
    encoding: utf-8
  - pyspelling.filters.ipynb:
      cell_type: markdown
  - pyspelling.filters.url:
  - pyspelling.filters.markdown:
        - pymdownx.superfences:
  - pyspelling.filters.html:
      comments: false
      #  - '*|*:not(script,style,code)'
      #  - 'code > *:not(.c1)'
        - code
        - pre
        - tt
    - 'content/*/*.ipynb'
    #- '**/.md/*.md'
  default_encoding: utf-8
- name: Python
    lang: en
    - .wordlist.txt
    encoding: utf-8
  - pyspelling.filters.ipynb:
      cell_type: code
  - pyspelling.filters.url:
  - pyspelling.filters.python:
    - 'content/*/*.ipynb'
    #- '**/.md/*.md'
  default_encoding: utf-8

We can then run that config as: pyspelling -c ipyspell.yml > typos.txt

The following Python code then generates a crude dataframe of the reseults:

import pandas as pd

fn = 'typos.txt'
with open(fn,'r') as f:
    txt = f.readlines()

# aspell
df = pd.DataFrame(columns=['filename', 'cell_type', 'typo'])

currfile = ''
cell_type = ''

for t in txt:
    t = t.strip('\n').strip()
    if not t or t in ['Misspelled words:', '!!!Spelling check failed!!!'] or t.startswith('-----'):
    if t.startswith('<htmlcontent>') or t.startswith('<py-'):
        if t.startswith('<html'):
            cell_type = 'md'
        elif t.startswith('<py-'):
            cell_type = 'code'

        currfile = t.split('/')[-1].split('.ipynb')[0]#+'.ipynb'
    df = df.append({'filename': currfile, 'cell_type': cell_type,
                    'typo': t}, ignore_index=True)

The resulting dataframe lets us filter by code or markdown cell:

We can also generate reports over the typos found in markdown cells, grouped by notebook:

df_group = df[(df['filename'].str.startswith('0')) & (df['cell_type']=='md')][['filename','typo']].groupby(['filename'])
for key, item in df_group:

    print(df_group.get_group(key).value_counts(), "\n\n")

Thsi gives basic results of the form:

Something that might be worth exploring is a tool that present a user with form that lets them enter (or select from a list of options?) a corrected version and that will then automatically fix the typo in the original file. To reduce the chance of false positives, it might also be worth showing the typo in it’s original context using the sort of display that is typical in a search engine results snippet, for example (eg ouseful-testing/nbsearch).