Notes on Narrative Science and Automated Insights
In October 2009, the New York Times Media Decoder blog picked up on a story that had been doing the rounds about a research project called Stats Monkey from the Intelligent Information Laboratory at Northwestern University. The Robots Are Coming!, it declared, with the immediate rejoinder, Oh, They’re Here. Using play by play baseball data, Stats Monkey produced human readable reports of a baseball game, formulaic admittedly, but good enough, particularly when complemented by quotes from a post-match press conference report. Mechanical churnalism complementing data-driven analysis, cast into prose. (It’s worth noting that the Media Decoder post itself is little more than a restatement of what was presumably the Stats Monkey website blurb at the time.)
In April 2010, Bloomberg Businessweek Magazine asked Are Sportswriters Really Necessary?, describing how Narrative Science, a company that incorporated at the start of that year and spun out off the back of the Stats Monkey project had teamed up with the Big Ten Network to produce automatically generated sports reports, a relationship that presumably continues to this day.
A year later, and Forbes magazine produced a report in June 2011 about GameChanger and Narrative Science: Fulfilling the Heretofore Unrealized Demand for Stilted Stories About Children’s Games, describing a tie-up between Narrative Science and GameChanger, a company that produces a scorekeeping app that allows sports fans, parents and coaches to capture data about a match.
(What other companies/apps are out there for crowdsourcing sports analytics in this way, I wonder?)
Using GameChanger data and narrative Science story generation tools, it was possible to automate the creation of match reports for small number audiences. I don’t know if these stories used to be freely accessible, but today the match reports appear to take the form of paywalled notion of recap stories.
Paywall aside, examples of other stories generated by Narrative Science using GameChanger data can be found using a simple web search on the phrase “Powered by Narrative Science and GameChanger Media”
You can also just search for the byline, as for example it appears in this report:
In passing, it’ll be interesting to see how automatically generated stories start to feed into the glitch aesthetic (h/t @danmcquillan for introducing me to this phrase and the related notion of the new aesthetic in his presentation at #opentech last week).
September 2011 saw a media outlook report from Mediabistro’s Media Jobs Daily noting that Narrative Science’s ‘Robot Journalists’ Now Tackling Real Estate. The story links through to a page on Builder Online that provides a summary report of housing data for various US cities.
What this example, and the GameChanger example, show is how the generation of timely text stories can be automated on top of the regularly updated datasets. The use of natural language interpretive text to describe patterns observed in the underlying data presumably also has SEO benefits.
That same month, September 2011, saw another stats-to-insight company, again emerging from the automated interpretation of sports data, renaming itself from StatSheet to Automated Insights. Today, Statsheet continues to publish game recaps combining short natural language summaries with statistical charts, all of which are presumably automatically generated. Within a year, the parent company, Automated Insights had scaled up and begun publishing recaps for Yahoo!’s fantasy sports matches.
More recently, Automated Insights have started producing realtime content feeds to support sports commentators – Real-time Insights for MLB – as well as feeding consumers via the stat.us powered Twitter feeds.
(See also: yseop, a French company that generates automated reports from data. [Any more?])
Fast forward to the start of 2013, and Narrative Science started publishing human readable prose reports based on US schools data (ProPublica: How To Edit 52,000 Stories at Once). They’re also doing a lot more work with financial reporting, for example with Forbes as well as for financial services clients, as this interview with Narrative Science’s Stuart Frankel describes.
Generating human readable reports from Google Analytics data and dashboards also appears to be a hot topic, with both Narrative Science (Automated Insight From Google Analytics With Quill) and Automated Insights (With Site Ai, Automated Insights Provides A Cliffs Notes Version Of Your Web Analytics) recently developing tools around this topic.
What I thought was particularly interesting about the ProPublica example was how it suggests a possible widespread future use of “automatically generated insight” pulling out headline interpretations from open data sets, as touched on in this great introductory technical presentation by Narrative Science’s Larry Adams (which also happens to mention the possibility of Narrative Science offering platform services via an API…? It also mentions work with the NHS?):
At one point during that presentation, Larry Adams suggests that Narrative Science use small set of narrative templates or story types (“the horserace” for example, or “top 10″) to frame the construction of their stories, as well as mentioning the sorts of feature that they look for within a data set (trends and changes in trends, for example, or outliers). Another presentation, this time by Narrative Science’s Kris Hammond also hints at some of the features they look for in data: “inflexion points, trends, correlations”.
So what sorts of techniques might we use ourselves to start generating the insights that we might be able to work up into simple narrative sentences, at least for starters?
Top 10, bottom 5 are easy pickings if we can rank the data somehow. I thought this trick for detecting inflexions by coding a time series symbolically and then using a regular expression to detect features was really interesting: Finding patterns in time series using regular expressions. And I wonder, how does the OpenSecrets anomaly tracker define the anomalies it detects?
PS seems like generate text summaries from data may be something the intelligence services may also be interested in: The CIA Invests in Narrative Science and Its Automated Writers
Other posts you might be interested in:
- The Tesco Data Business – Notes on “Scoring Points”
- More Remarks on the Tesco Data Play
PPS I note that Narrative Science have picked up some more funding… Narrative Science raises $11.5M in equity funding
PPPS See also Data2Text, a start-up spun-out of a Natural language generation research group at the University of Aberdeen.